cnns hard voting for multi-focus image...

21
Vol.:(0123456789) 1 3 Journal of Ambient Intelligence and Humanized Computing https://doi.org/10.1007/s12652-019-01199-0 ORIGINAL RESEARCH CNNs hard voting for multi-focus image fusion Mostafa Amin‑Naji 1  · Ali Aghagolzadeh 1  · Mehdi Ezoji 1 Received: 31 July 2018 / Accepted: 8 January 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract The main idea of image fusion is gathering the necessary features and information into one image. The multi-focus image fusion process is gathering this information from the focused areas of many images and the ideal fused image have all focused part from the input images. There are many studies of multi-focus image fusion in the spatial and transform domains. Recently the multi-focus image fusion methods based on deep learning have been emerged, and they have enhanced the deci- sion map greatly. Nevertheless, the construction of an ideal initial decision map is still difficult and inaccessible. Therefore, the previous methods have high dependency on vast post-processing algorithms. This paper proposes a new convolution neural networks (CNNs) based on ensemble learning for multi-focus image fusion. This network uses hard voting of three branches CNNs that each branch is trained on three different datasets. It is very reasonable and reliable to use various models and datasets instead of just one and it would help to the network for improving the accuracy of classification. In addition, this paper introduces new simple arranging of the patches of the multi-focus datasets that is very useful in obtaining better classification accuracy. With this new arrangement of datasets, three types of multi-focus datasets are created with the help of gradient in the directions of vertical and horizontal. This paper illustrates that the initial segmented decision map of the proposed method is very cleaner than the others, and even it is cleaner than the other final decision maps after refined with a lot of post-processing algorithms. The conducted experimental results and analysis evidently validate that the proposed network have the cleanest initial decision map and the best quality of the output fused image compared to the other state of the art methods. These comparisons are performed with various qualitative and quantitative assessments that are assessed by several fusion metrics for demonstrating the superiority of the proposed network. Keywords Multi-focus · Image fusion · Hard voting · Ensemble learning · Deep learning · Convolution neural networks 1 Introduction The main idea of image fusion is gathering important and the essential information from the input images into one image which that single image has ideally all of this infor- mation of the input images (Amin-Naji and Aghagolzadeh 2018; Stathaki 2011; Li et al. 2017). There is various type of image fusion and one of this type is multi-focus image fusion. The multi-focus image fusion is used to gather all of the focused parts of the input images into the single fused image. The problem of multi-focus is the occurrence of the limited depth of focus in optical lenses of cameras (Amin- Naji and Aghagolzadeh 2018). This problem has created a good incentive to solve it among the researchers. There are many conducted studies for image fusion (e.g., multi-focus image fusion) in recent years, which are divided into spatial and transform domains categories. These methods fuse the input images with the region, block, and pixel-based methods (Amin-Naji and Aghagolzadeh 2018; Stathaki 2011; Li et al. 2017; Liu et al. 2018a; Ma et al. 2019; Ghassemian 2016; James and Dasarathy 2014; Smith and Heather 2005; Haghighat et al. 2011; Nejati et al. 2015). The popularly used transforms are discrete cosine transform (DCT) and multi-scale transform (MST). The DCT based methods are suitable for real-time applications but suffer from blocking artifacts due to block processing in DCT domain (Amin-Naji and Aghagolzadeh 2018, 2017a, b; * Ali Aghagolzadeh [email protected] Mostafa Amin-Naji [email protected] Mehdi Ezoji [email protected] 1 Faculty of Electrical and Computer Engineering, Babol Noshirvani University of Technology, Babol, Iran

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

Vol.:(0123456789)1 3

Journal of Ambient Intelligence and Humanized Computing https://doi.org/10.1007/s12652-019-01199-0

ORIGINAL RESEARCH

CNNs hard voting for multi-focus image fusion

Mostafa Amin‑Naji1 · Ali Aghagolzadeh1  · Mehdi Ezoji1

Received: 31 July 2018 / Accepted: 8 January 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019

AbstractThe main idea of image fusion is gathering the necessary features and information into one image. The multi-focus image fusion process is gathering this information from the focused areas of many images and the ideal fused image have all focused part from the input images. There are many studies of multi-focus image fusion in the spatial and transform domains. Recently the multi-focus image fusion methods based on deep learning have been emerged, and they have enhanced the deci-sion map greatly. Nevertheless, the construction of an ideal initial decision map is still difficult and inaccessible. Therefore, the previous methods have high dependency on vast post-processing algorithms. This paper proposes a new convolution neural networks (CNNs) based on ensemble learning for multi-focus image fusion. This network uses hard voting of three branches CNNs that each branch is trained on three different datasets. It is very reasonable and reliable to use various models and datasets instead of just one and it would help to the network for improving the accuracy of classification. In addition, this paper introduces new simple arranging of the patches of the multi-focus datasets that is very useful in obtaining better classification accuracy. With this new arrangement of datasets, three types of multi-focus datasets are created with the help of gradient in the directions of vertical and horizontal. This paper illustrates that the initial segmented decision map of the proposed method is very cleaner than the others, and even it is cleaner than the other final decision maps after refined with a lot of post-processing algorithms. The conducted experimental results and analysis evidently validate that the proposed network have the cleanest initial decision map and the best quality of the output fused image compared to the other state of the art methods. These comparisons are performed with various qualitative and quantitative assessments that are assessed by several fusion metrics for demonstrating the superiority of the proposed network.

Keywords Multi-focus · Image fusion · Hard voting · Ensemble learning · Deep learning · Convolution neural networks

1 Introduction

The main idea of image fusion is gathering important and the essential information from the input images into one image which that single image has ideally all of this infor-mation of the input images (Amin-Naji and Aghagolzadeh 2018; Stathaki 2011; Li et al. 2017). There is various type of image fusion and one of this type is multi-focus image fusion. The multi-focus image fusion is used to gather all of

the focused parts of the input images into the single fused image. The problem of multi-focus is the occurrence of the limited depth of focus in optical lenses of cameras (Amin-Naji and Aghagolzadeh 2018). This problem has created a good incentive to solve it among the researchers.

There are many conducted studies for image fusion (e.g., multi-focus image fusion) in recent years, which are divided into spatial and transform domains categories. These methods fuse the input images with the region, block, and pixel-based methods (Amin-Naji and Aghagolzadeh 2018; Stathaki 2011; Li et al. 2017; Liu et al. 2018a; Ma et al. 2019; Ghassemian 2016; James and Dasarathy 2014; Smith and Heather 2005; Haghighat et al. 2011; Nejati et al. 2015). The popularly used transforms are discrete cosine transform (DCT) and multi-scale transform (MST). The DCT based methods are suitable for real-time applications but suffer from blocking artifacts due to block processing in DCT domain (Amin-Naji and Aghagolzadeh 2018, 2017a, b;

* Ali Aghagolzadeh [email protected]

Mostafa Amin-Naji [email protected]

Mehdi Ezoji [email protected]

1 Faculty of Electrical and Computer Engineering, Babol Noshirvani University of Technology, Babol, Iran

Page 2: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

Haghighat et al. 2011; Amin-Naji et al. 2017; Naji and Agh-agolzadeh 2015a, b; Phamila and Amutha 2014; Cao et al. 2015). On the other side, multi-scale transform (MST) based methods are very common (Petrovic and Xydeas 2004; Li et al. 1995; Zhang and Guo 2009; Kumar 2013; Bavirisetti and Dhuli 2016; Dogra et al. 2017; Liu et al. 2018b; Zhou et al. 2014). These methods decompose the input images and fuse the essential information and finally reconstruct the fused image. These methods usually are associated with ringing artifacts in the edge area (Amin-Naji and Aghago-lzadeh 2018; Haghighat et al. 2011).

The oldest of multi-focus methods are the spatial based methods, which recently pixel-based methods have been appreciated by researchers (Li et al. 2006, 2013a, b, 2017; Liu et al. 2015, 2018a; Ma et al. 2019; Ghassemian 2016; Huang and Jing 2007; Wu et al. 2013; Nejati et al. 2017; Liang et al. 2012; Pertuz et al. 2013; Yang et al. 2017; Kumar 2015; Guo et al. 2015; Zhang et al. 2017). These methods estimate and construct the fused image directly from the intensity values of the input images. The main idea of these methods is to estimate the ideal decision map for the fusion process. Since Deep Learning (DL) have gained significant success in image processing and com-puter vision applications, the convolution neural networks (CNNs) fusion based methods have greatly enhanced the decision map of fusion process (Liu et al. 2017, 2018a; LeCun et al. 2015; Goodfellow et al. 2016; Du and Gao 2017, 2018; Tang et al. 2018; Xu et al. 2018; Guo et al. 2018). Liu et al. (2017) were a firstly used deep convolu-tion neural networks in image fusion. This method used the Siamese network for calculating the difference between the focused and unfocused blocks. In Du and Gao (2017), the image segmentation-based multi-focus image fusion through multi-scale convolution neural network (MSCNN) is introduced. This work introduces image segmentation between the focused and unfocused regions through multi-scale convolution neural network. In addition, they intro-duced all convolutional neural network (ACNN) that only replaced Max Pooling with the stride of convolution layers (Du and Gao 2018). In Tang et al. (2018), the pixel-wise convolution neural network (p-CNN) was introduced for classification of the focused and unfocused areas. Their contributions are creating a new type of datasets and a good approach for acceleration of constructing the final fused image. Their new type of datasets are generated from the tiny 32 × 32 image of Cifar-10 dataset and they considered several conditions of focus with 12 geometric type of 32 × 32 masks for the multi-focus image dataset. Nevertheless, we think that these laborious works of creat-ing the new dataset are no need for the task of multi-focus image fusion. In addition, these datasets are generated from tiny 32 × 32 images. All of these CNNs based multi-focus image fusion techniques have greatly enhanced the decision

map, but their initial segmented decision map has many errors. So they need to use a lot of post-processing such as morphological operations (opening and closing), guid-ing filters, watershed, consistency verification (CV), and small region removal on the initial segmented decision map for constructing a satisfactory final fusion decision map (Zhou et al. 2014; Liu et al. 2015, 2017; Guo et al. 2015, 2018; Zhang et al. 2017; Li et al. 2013a; Du and Gao 2017; Tang et al. 2018). Therefore a large share of the suitable performance of their final decision map quality is due to using vast post-processing algorithm, which it is a separated issue from their proposed CNNs network. Guo et al. (2018) introduced multi-focus image fusion method based on Fully Convolution Network (FCN). This method used a very deep FCN which contains 18 convolution layers and three deconvolution layers for constructing the initial decision map. This network has 267,506 parameters which must be learned during the training; it is a huge number of network parameters against the other fusion based networks. This method used a tedious work to create a multi-focus dataset for FCN based network. They utilized PASCAL VOC 2012—classification and segmentation dataset; it requires the segmentation results to construct multi-focus dataset where it does not include all scene in the nature. Besides this, their initial decision map is very inappropriate and unacceptable. Therefore, their initial decision map is not better than the other methods, and they require another algorithm to improve the initial decision map. Hence, they utilized the fully connected conditional random field (CRF) (Krähenbühl and Koltun 2011) as post-processing which is a method for multi class image segmentation. So, a large share of the suitable performance of their final decision map quality is due to using CRF segmentation which is a separated issue from their FCN based network.

In this paper, a new CNN architecture is proposed to construct a better initial segmented decision map than the other methods. The proposed architecture is to use the ensemble learning of CNNs. Actually, the proposed method is hard voting of three CNNs that are trained on three different datasets. This is a very reasonable idea to use an ensemble of CNNs than just one single sequence CNNs or dataset. In addition, the proposed method intro-duces a new type of multi-focus datasets. The proposed method changes simply the arrangement of the patches of the multi-focus datasets for obtaining the better accu-racy than the popular multi-focus datasets. The initial seg-mented decision map of the proposed method is similar or even better than that of the final decision map of the other methods, which is obtained after many post-processing.

This paper is prepared as follows: In Sect. 2, the pro-posed network of hard voting of CNNs is explained in details. The experiments and analysis are presented in Sect. 3, and finally conclusions are presented in Sect. 4.

Page 3: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

2 Proposed method

2.1 The ensemble of convolution neural networks

The convolution neural networks (CNNs or ConvNets) is a popular network for deep learning. The CNNs are a spe-cial category of artificial neural networks (ANN) designed for representation and processing data (e.g. image) using a grid-like structure. The CNNs architecture is based on parameter sharing and sparse interactions. Therefore, it is very suitable for efficient learning of spatial invariance in images (Goodfellow et al. 2016; LeCun et al. 1998; Deep Learning from Wikipedia; CNN from Wikipedia). In general, there are four layers in a CNNs architecture counting Convolutional Layer (conv), Rectifying Linear Unit (ReLU), Pooling Layer (subsampling), and Fully Connected layer (FC).

The ensemble learning in artificial neural network defines a learning paradigm where several networks are simultaneously trained on one or several datasets. It is very reasonable and reliable to use different datasets and models instead of one model or dataset. Ensemble learn-ing would show improved generalization capabilities com-paring to single networks and datasets (Zhou et al. 2002; Dietterich 2000; Opitz and Maclin 1999; Maji et al. 2016). In addition, the ensemble learning methods can be still used in deep learning. There are various types of ensemble learning in machine learning (Opitz and Maclin 1999). One of simplest ensemble learning method is hard voting of predictions from separate trained models on different datasets. It is expected that the hard voting of many CNNs models which are trained on many datasets would help to achieve higher accuracy and capability of classification.

2.2 Hard voting of CNNs for multi‑focus image fusion

The multi-focus image fusion and the process of creating the decision map is viewed as a classification problem in this paper, which the focus and activity level measure-ment in image fusion is considered as feature extraction. So it is possible that the CNNs is utilized in multi-focus image fusion. The base of using CNNs is different from the conventional classifier for image fusion that they need to extract features manually (by humans) in the spatial and transform domains. The CNNs tries to obtain a hier-archical feature representation for images with different levels of abstraction. The convolution layers are the fea-ture extraction parts of the fusion process and then the FC layers at the end of the network roles as the classification part. The image fusion methods based on CNNs have over-come the difficulty of manually design the complex focus

criterion and activity level measurement for each fusion rules. The CNNs based fusion methods create both activa-tion level measurement and fusion rule together during the learning processes. Also they have greatly increased the quality of fusion results and accuracy of focus criterion against conventional fusion methods.

The proposed method in this paper extends the ensem-ble learning into CNNs hard voting of three CNNs models which can be a simple form of ensemble learning which trained on different datasets. This idea would help network greatly for achieving higher accuracy and capability of clas-sification. Therefore, it is very helpful and beneficial for achieving better initial segmented decision map for multi-focus image fusion. The input multi-focus images should be registered and transformed into grayscale images for con-structing decision map if they are unregistered and color images. For simplicity, two images A and B are considered. A part of image A is focused while this part is unfocused in image B. The part of the focused image has high contrast and more information. Therefore, it has more raised and evident edges. According to popular multi-focus image fusion meth-ods, the input multi-focus images are divided into 16 × 16 blocks. Assume that PA is a patch of focused area in image A and PB is patch of unfocused area in image B. The accu-racy of classification of PA and PB is poor if the CNNs are trained on the dataset of PA and PB. That is why the other CNNs fusion method uses the Siamese network to distin-guish between patches and achieving better accuracy. Our proposed method introduces a new simple type of datasets. Our proposed method changes simply arranging of the data-set to obtaining better accuracy than that the other type of multi-focus images datasets. Our proposed method creates a new dataset of 32 × 16 by concatenating PA and PB as Fig. 1. This new arrangement of patches helps the network to obtain better accuracy. Therefore, the proposed network is trained on the new arrangement dataset. As mentioned, this paper designs new CNNs network with hard voting of three CNNs, which are trained on three different datasets. Actually, a net-work introduces better results when it gets advice from many networks that were trained on different datasets. The pro-posed method prepares three efficient datasets for learning. One dataset is the original input images and the other two

Fig. 1 New arrangement of the focused and unfocused parts for input images in proposed methods

Page 4: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

datasets are Gradient in directions of vertical and horizontal of the input images since it is very suitable to measure the amount of edges in the image. The original, Gx, and Gy for one sample of datasets are shown in Fig. 2a–c, respectively. In this paper, more than 300 high quality images are selected randomly from the COCO 2014 (Microsoft) datasets. For creating focus condition like as real multi-focus captured with a real camera, each image is passed through four differ-ent Gaussian filters with a standard deviation of 9 and sizes of 9 × 9, 11 × 11, 13 × 13, and 15 × 15. Therefore, there are five types images (original image and the four versions of the blurred image) for each selected images in the COCO datasets. Then, the gradient in the directions of horizontal (Gx) and vertical (Gy) are applied for each of these five types images. In other words, there are five types of images for each image in the original, Gx, and Gy datasets. In the next step, each type of images in these three datasets are divided into 16 × 16 blocks. The block of the original images of these three datasets are considered as focused, and each corresponding blocks of the four blurred versions of them are considered as unfocused. Each block of every original image in these three datasets are concatenated as a 32 × 16 with the corresponding blocks of the blurred version of that image according to the proposed arrangement of the focused and unfocused of Fig. 1. Each 32 × 16 patches of the upper focused and the lower focused are considered with labels of 0 and 1, respectively. With this procedure, 2,300,000 patches with labels of 0 (focused) and 1 (unfocused) are generated for each of the original, Gx, and Gy datasets, where we take 2,000,000 patches for training datasets and 300,000 patches for test datasets.

2.3 The proposed network architecture

The schematic diagram of the procedure of the proposed convolution neural network with more details is shown in Fig. 3. This network contains three simple CNNs, which

each CNNs is trained on different datasets. These CNNs con-tain three convolution layers of kernel size of 3 × 3, Stride of 1 × 1, padding size of 1 × 1, non-linear activation function of ReLU; these layers have feature maps of size of 64, 128, and 256, respectively. In addition, these CNNs have one FC layer at the end of each network. The Max pooling of 2 × 2 is used for the second and fourth layers. These three CNNs are trained on dataset of the original image, Gx, and Gy. There-fore, there are two neurons for each CNNs which indicate label 0 for unfocused and label 1 for focused patches. In the following, the final prediction for label 0 and 1 are achieved with the hard voting based on the prediction of each CNNs. As expected, the upcoming results show that how much this contribution help the network for high accuracy classifica-tion in multi-focus image fusion. The classification accuracy of the trained network is 99.02% for 2,000,000 training and 98.84% for 300,000 test datasets. If we eliminate the hard voting of from the architecture, the accuracy is decreased. In other words, the hard voting that rules as an ensemble learning in the proposed architecture is helping significantly to the accuracy of classification in multi-focus image fusion. The accuracy results of the trained network on test datasets without hard voting are 98.03%, 97.80%, and 97.85% for the first, second, and third branches of CNNs of the proposed architecture, respectively. While the accuracy result after hard voting is 98.84% on test datasets. It should be note that the proposed network aim is to create the ideal initial decision map and the accurate results are just for indicating whether the training process works well. Hence, the initial decision map is very cleaner when the hard voting is applied to consult with three branches CNNs unlike the only one single CNNs were used. In other words, the initial decision map with hard voting is very cleaner because it consulted on three decision maps that every map is results of each branch of CNNs and datasets.

2.4 Network setting and training platform

The prerequisite for patches’ classification is the stand-ardization (normalizing) of the datasets with subtracting the mean and dividing by the standard deviation of the patches in the datasets. The mean and standard deviation of 2,000,000 original images, Gx, and Gy of the training data-set are calculated as the values of µ1 = 0.432, σ1 = 0.097, µ2 = 0.054, σ2 = 0.091, and µ3 = 0.063, σ3 = 0.097, respec-tively. These values are used for normalizing the patches of datasets for the proposed network. There are many opti-mization techniques to update the parameters of the model. The proposed network is trained with the stochastic gradi-ent descent (SGD) optimizer, which is the most common choice of optimizer in deep learning networks. Multi-focus image fusion task with the proposed patch feeding is very easily learnable; we choose the learning rate of 0.0002,

Fig. 2 Sample of 32 × 16 training block of the created dataset. a Orig-inal image. b Gx of image. c Gy of image

Page 5: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

Fig. 3 Schematic diagram of the proposed convolution neural network and the flowchart of the HCNN proposed method for getting the initial segmented decision map

PA

(original)

PB

(original)

PA

(original)

HardVoting

Kernel = 3× 3Stride = 1× 1Filter= 64Padding = 1× 1

Kernel = 3× 3Stride = 1× 1Filter= 128Padding = 1× 1Max-Pooling = 2×2

Kernel = 3× 3Stride = 1× 1Filter= 256Padding = 1× 1Max-Pooling = 2×2

Fully Connection (FC)=8192

2

64×32×16

8192

. . . . . .

Input patch16×32

(Gy of image)

Input patch16×32

(Gx of image)

Input patch16×32

(original image)

Source Image A Source Image B

Score MapInitial Binary Segmented Map

(Without post-processing)Fused Image

PA

(original)

Kernel = 3× 3Stride = 1× 1Filter= 64Padding = 1× 1

Kernel = 3× 3Stride = 1× 1Filter= 128Padding = 1× 1Max-Pooling = 2×2

Kernel = 3× 3Stride = 1× 1Filter= 256Padding = 1× 1Max-Pooling = 2×2

Fully Connection (FC)=8192

2

64×32×16

8192

. . . . . .

Kernel = 3× 3Stride = 1× 1Filter= 64Padding = 1× 1

Kernel = 3× 3Stride = 1× 1Filter= 128Padding = 1× 1Max-Pooling = 2×2

Kernel = 3× 3Stride = 1× 1Filter= 256Padding = 1× 1Max-Pooling = 2×2

Fully Connection (FC)=8192

2

64×32×16

8192

. . . . . .

PA

(Gx)

PA

(Gx)

PA

(Gy)

PA

(Gy)

PB

(original)

PB

(original)

PB

(Gx)

PB

(Gx)

PB

(Gy)

PB

(Gy)

PA

(original)

PB

(original)

Box #1: prepare 32×16 patches from input images with the proposed path feeding strategy

PA

(Gx)

PB

(Gx)

PA

(Gx)

PB

(Gx)

PA

(Gy)

PB

(Gy)

PA

(Gy)

PB

(Gy)

Box #1:

Box#2: The process of creating the score map Final Prediction

Page 6: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

the momentum of 0.9, and the weight decay of 0.0005. The greater learning rate has the undesired effect on the computed loss, and the lower learning rate leads to more training time and epochs. Therefore, a suitable value of the learning rate for the proposed network is 0.0002. Also, we used the scheduler to adjust lthe earning rate based on the number of epochs. Among the existing scheduler methods, we choose the scheduler of StepLR with a step size of 1 and a gamma of 0.1, which the step size and gamma are the period of learning rate decay and the multiplicative factor of learning rate decay, respectively. In order to speed up the learning, the batch normalization is used in the proposed network. Actually, to increase the stability of the CNN, batch normalization normalizes the output of a previous activation layer in each convolution layers by subtracting the batch mean and dividing by the standard deviation of the batch. The batch size of 32 is chosen for training the network. Lower and higher batch sizes (comparing with 32) result the undesirable loss for large iteration. In addi-tion, the network needs a lower number of epochs when the batch size of 32 were chosen. Because of the simplicity of the proposed network, it has a very fast training procedure and it can be easily learned and trained on the training dataset. Therefore, we take 20 epochs for the training the proposed network. In our observations, the more epochs would not significantly influence the accuracy of the clas-sification. Also, the Cross Entropy Loss is used as the cri-terion of the proposed network. The Cross Entropy Loss is suitable and useful when training a classification problem with two classes in the task of multi-focus image fusion. In our observation, the parameter settings with the small changes do not significantly influence the accuracy and it just reduces the consumed time to reach the ideal learning time; the essence of the proposed network is yielding to the best results compared with the previous networks. The proposed network is coded with PyTorch (0.4.0), and it is trained on an operating system of Ubuntu Linux 16.04 LTS, the GPU of STRIX-GTX1080-O8G, and the CPU of Core i7 6900k with 32 GB RAM. The proposed network with the devised setting and computer hardware has been under training for about 5 h.

2.5 The fusion scheme

After training the proposed network, the weights of the network is saved as a pre-trained network, and it is ready for the fusion process. The input multi-focus images (imA and imB) are fed into the pre-trained network with the stride of two according to the proposed 32 × 16 patches feeding strategy of Fig. 1 and the schematic diagram of Fig. 3. As the previous CNN base methods, the input patches of the source image should be overlapped for pro-viding the pixel-wise image fusion. Actually, the position

of the selected patches from the input images to construct the concatenated 32 × 16 patches (as discussed in the patch feeding strategy) are different from each other in two pix-els. There is a small difference in creating score map of the proposed method with the previous CNN based papers. In our method, each pixel is contributed in vicinity of the overlapped patches. In addition, it is involved with many times in the voting with decreasing and increasing the score of a pixels label for the construction of the initial decision map.

where r, c, and M indicate the row and column of the image, and Decision Map, respectively.

Afterward, the initial segmented decision map of the pro-posed method is constructed as (2). In the final step, the final fused image is calculated as (3).

where imA(r,c) and imB(r,c) are input multi-focus images.The process of constructing the initial decision map and

fused image have simple mathematics with the pre-trained network. The upcoming results show that the initial seg-mented decision map of the proposed method is very better than that of the others even they applied many post-pro-cessing algorithms. If there is a need of post-processing, it can be simply applied on the initial decision map of the proposed method.

In summary, our algorithm can be summarized into five steps as below:

Step 1: Convert the color input multi-focus images into gray-scale images and compute their gradient images in horizontal and vertical directions.

Step 2: Prepare the concatenated 32 × 16 patches accord-ing to the proposed patch feeding strategy.

Step 3: Fed the concatenated patches into the proposed network and giving the focused and unfocused label.

Step 4: Construct the score map and the initial segmented decision map.

Step 5: Fuse the input multi-focus images with the final decision map.

2.6 Network complexity assessments

As mentioned in the introduction part of this paper, the deep learning based image fusion methods have greatly

(1)

M(r, c) =

{Decrease{M(r:r + 16, c:c + 16)} if label = 0

Increase{M(r:r + 16, c:c + 16)} if label = 1,

(2)D(r, c) =

{1 if M(r, c) > 0

0 if else,

(3)F(r, c) = D(r, c) × imA(r, c) + (1 − D(r, c)) × imB(r, c),

Page 7: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

enhanced the decision map and the quality of the fused image. Hence, the recently published papers in the image fusion applications are based on Deep Learning. The consumed time of each step of the proposed method (in Sect. 2.5) for the two 512 × 512 colors multi-focus image is listed in Table 1. Similar to the other deep learning based methods, the main step of the proposed method that consumed more time is corresponding to the processing phase through the deep network and the consumed time of the other steps are negligible. Since the platform of the deep learning based methods are usually different, and most of their source codes are not provided, we can not compare these methods in time-consuming comparison and it would be tough and unfairly work. Therefore, the best fairly way of complexity comparison of deep learning based methods with each other is the counting the number of parameters of the main deep learning based network instead of time-consuming comparison. The number of the used parameters of the network for the previous state of the art multi-focus fusion methods based on Deep Learn-ing and the proposed method are listed in Table 2. These parameters should be updated during the training process, and the input multi-focus patch images should be passed through these parameters of the pre-trained network after the training process. The number of parameters of the net-work of HCNN is very lower than the method of FCN (Guo et al. 2018) and it almost close to the methods of CNN (Liu et al. 2017) and MSCNN (Du and Gao 2017). But it is essential to note that the proposed method has the least demanding to vast post-processing algorithms which the previous methods have a strong need for refining the initial decision map. Therefore, the proposed network saves the times to be spent in the post-processing algo-rithms for the initial segmented decision map refining. In addition, we will prove in further experiments that the initial segmented decision of the proposed method has the cleanest decision map and the best quality of the fused image among the other methods.

3 Experimental results and analysis

In this section, the performance of the proposed method and state of the art methods are assessed with qualitative and quantitative experiments.

3.1 Experimental settings

In order to verify the performance of the proposed method, 13 methods are selected for comparison in this paper. The proposed method is compared with the methods of NSCT (Zhang and Guo 2009), DCHWT (Kumar 2013), SDMF (Bavirisetti and Dhuli 2016), MWGF (Zhou et al. 2014), CBF (Kumar 2015), DSIFT (Liu et al. 2015), SSDI (Guo et al. 2015), BFMM (Zhang et al. 2017), GFF (Li et al. 2013a), IMF (Li et  al. 2013b), CNN (Liu et  al. 2017), MSCNN (Du and Gao 2017) and FCN (Guo et al. 2018). The source codes of MWGF, NSCT, SDMF, CBF, DCHWT, DSIFT, SSDI, GFF, IMF, BFMM, CNN are provided by their authors and they are available in the website of (Liu; Kang; Wang; Zhiqiang; Zhang; Kumar; Guo; Bavirisetti). But the source code of MSCNN and FCN are not provided; so the reported results and the real test multi-focus images of the papers of MSCNN and FCN are used for quantity com-paring of the proposed method with the other methods in fair conditions. We used 28 pairs of non-referenced multi-focus images for comparison the proposed method with the others. The 20 pairs of color multi-focus images of Lytro dataset is available in (Nejati). The other eight pairs multi-focus images can be found in the websites of (Liu; Kang; Wang; Zhiqiang; Zhang; Kumar; Guo; Bavirisetti). For comparison of the proposed method with the others, we used 19 fusion metrics which most of them are taken from the websites of (Kumar; Liu), and the rest of them were coded by ours or were obtained directly from the corresponding authors of the previous paper.

3.2 Objective quality metrics

Since assessment of the fused image for non-referenced multi-focus images is very difficult, the most reliable way to compare the fused images with each other is using qual-itatively assessments. Therefore, introducing the quantity

Table 1 Time consumed for two 512 × 512 color multi-focus images for each step of the proposed method

Steps Step 1 Step 2 Step 3 Step 4 Step 5

Time (s) 0.2278 0.5288 56.8952 0.0790 0.5899

Table 2 The number of used parameters of the network for the multi-focus fusion methods based on deep learning

Method CNN Liu et al. (2017)

MSCNN Du and Gao (2017)

FCN Guo et al. (2018)

HCNN (the pro-posed)

Total number of parameters 25,344 17,152 267,506 38,016Normalized complexity ~ 0.66 ~ 0.45 ~ 7 1

Page 8: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

fusion metrics that their results would be very close to qualitatively comparison is considered as a challenge for image fusion researches. There are a lot of non-referenced image fusion metrics that evaluate the quality of the non-referenced multi-focus images fusion. Each of them has some advantages and disadvantages that may not be able to evaluate the quality of the output images properly. The fusion metric has a better performance which is consistent with human visual evaluations and can be used to evaluate image fusion. In this part, some famous non-referenced fusion metrics are introduced with adequately mathemati-cal formulas.

(1) API: the average pixel intensity (API) measures the mean or contrast of the fused image as: API =

∑m

i=1

∑n

j=1f (i,j)

m×n ,

where f(i ,j), m and n are the intensity and the sizes of the fused image (Kumar 2013, 2015).

(2) SD: standard deviation (SD) of the fused image measures the spread of the image intensity with SD =�∑m

i=1

∑n

j=1(f (i,j)−API)2

m×n (Kumar 2013, 2015).

(3) AG: average gradient (AG) measures the amount of clarity and sharpness of the fused image with AG =

∑m

i=1

∑n

j=1((f (i,j)−f (i+1,j))2+(f (i,j)−f (i,j+1))2)

1∕2

m×n (Kumar 2013, 2015).

(4) H: entropy (H) estimates the amount of information present in the fused image with H = −

∑255

k=0pF(k) log2(pF(k))

where pF(k) stands for probability of intensity value k in image F (Kumar 2013, 2015).

(5) SF: spatial frequency (SF) measures the overall activ-ity level and sharpness in the regions of the fused image by

SF =√RF2 + CF2 , where RF =

�∑

i

∑j (f (i,j)−f (i,j−1))

2

m×n and

CF =

�∑

i

∑j (f (i,j)−f (i−1,j))

2

m×n (Kumar 2013, 2015).

(6) MI: mutual information (MI) measures the overall mutual information between the source images and the fused image by MI = MIAF + MIBF (Kumar 2013, 2015). The mutual information between the fused image and the source images of A and B are calculated as MIAF =

∑i

∑j PA,F(i, j)

log2

(PA,F(i,j)

PA(i)PF(j)

) and MIBF =

∑i

∑j PB,F(i, j)log2

�PB,F(i,j)

PB(i)PF(j)

� ,

respectively.(7) FS: fusion symmetry (FS) indicates how much sym-

metrical the fused image is with respect to the source images that is given by FS = 2 −

|||MIAF

MI− 0.5

||| (Kumar 2013, 2015).

(8) CC: correlation coefficient (CC) measures a relevance of the fused image to the source images that is given by CC =

rAF+r

BF

2 . The relevance of the fused image and the source images

of A and B are calculated as rAF

=∑

i

∑j(A(i,j)−A)(F(i,j)−F)√

(∑

i

∑j(A(i,j)−A)

2)(∑

i

∑j(F(i,j)−F)

2)

and rBF =∑

i

∑j(B(i,j)−B)(F(i,j)−F)√

(∑

i

∑j (B(i,j)−B)

2)(∑

i

∑j (F(i,j)−F)

2) , respectively (Kumar

2013, 2015).(9) QAB/F: the total gradient information transferred from

the source images to the fused image (QAB/F) is calculated

by QAB∕F =

∑i,j Q

AFi,jwAi,j+QBF

i,jwBi,j∑

i,j wAi,j+wB

i,j

, where QAFi,j

and QBFi,j

are weighted

by wAi,j

and wBi,j

, respectively. The constant value L is consid-

ered for wAi,j=[gAi,j

]L and wB

i,j=[gBi,j

]L . Each gradient infor-

mation transferred from the image A and B to the fused image are given by QAF(i, j) = QAF

g(i, j)QAF

α(i, j) and

QBF(i, j) = QBFg(i, j)QBF

α(i, j) , respectively, where Qg and Qα

are the edge strength and orientation preservation values, respectively. In procedure of computing Qg and Qα , there are constants such as Γg,Kg , �g , Γ� , K� , and �� that determine the exact shape of the sigmoid functions used to form the edge strength and orientation preservation values. These values in the some code and papers are different (Petrovic and Xydeas 2005; Xydeas and Petrovic 2000).

(10) LAB/F: the total gradient information lost during the image fusion process from source image to the fused image

(LAB/F) is given by LAB∕F =

∑i,j ri,j[(1−Q

AFi,j)wA

i,j+(1−QBF

i,j)wB

i,j]

∑i,j w

Ai,j+wB

i,j

, where

ri,j =

{1 if gF

i,j< gA

i,jor gF

i,j< gB

i,j

0 if otherwise (Petrovic and Xydeas

2005; Xydeas and Petrovic 2000).(11) NAB/F: the total fusion artifacts or noise added in the

fused image due to fusion process (NAB/F) that is not related

to the source images is given by NAB∕F =

∑i,j Ni,j(w

Ai,j+wB

i,j)

∑i,j w

Ai,j+wB

i,j

,

where Ni,j =

{2 − QAF

i,j− QBF

i,jif gF

i,j> (gF

i,j&gF

i,j)

0 if otherwise

(Petrovic and Xydeas 2005; Xydeas and Petrovic 2000). Recently, Kumar revised the NAB∕Fwith the NAB∕F

m such that QAB∕F + LAB∕F + N

AB∕Fm = 1 . In papers of Kumar (2013,

2015), NAB∕Fm is given by NAB∕F

m =

∑i,j ri,j[(1−Q

AFi,j)wA

i,j+(1−QBF

i,j)wB

i,j]

∑i,j(w

Ai,j+wB

i,j)

,

where AMi,j =

{1 if gF

i,j> (gF

i,j&gF

i,j)

0 if otherwise.

(12) QMI: Hossny et al. (2008) normalized the mutual information measure in order to give the correct estima-tion of transferred information from the source images into the fused image. This metric is computed by QMI = 2

[I(F,X)

H(F)+H(x)+

I(F,Y)

H(F)+H(Y)

].

(13) QY (SSIM): Yang et al. (2008) introduced the fusion metric of QY (SSIM) which is based on the structural simi-larity (SSIM) in the local windows. QY (SSIM) is calculated by averaging all values over the whole image that is given

Page 9: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

by QY (x, y, f ) =1

�w�∑

w∈W Q(x, y, f |w) , where Q(x, y, f |w) ={

𝜆(w)SSIM(x, f|w) + (1 − 𝜆(w))SSIM(y, f |w) for SSIM(x, y|w) ≥ 0.75

max{𝜆(w)SSIM(x, f |w) + SSIM(y, f |w)} for SSIM(x, y|w) < 0.75 .

In the Q(x, y, f |w) , λ(w) = S(x|w)S(x|w)+S(y|w) is the local weight, and

S(x|w), S(y|w) are the variances of wx and wy, respectively.(14) QNICE: the nonlinear correlation information

entropy NICER (QNICE) used as a nonlinear correla-tion measure of the related variable w that is given by NICER = 1 − HR = 1 +

∑k

i=1

λRi

klog

λRi

k , where �R

i , i = 1, …,

k, are the eigenvalues of the nonlinear correlation matrix (Wang et al. 2008).

(15) QTE: the metric of QTE (Cvejic et al. 2006) is a modi-fication of MI which is based on Tsallis entropy which is a one-parameter generalization of Shannon entropy. Tsallis entropy is a form of non-extensive entropy. The fusion met-ric of QTE is computed by M�

FXY= I�

FX(f , x) +M�

FY(f , y) ,

where I�FX(f , x) =

1

1−�

�1 −

∑F,X

(PFX (f ,x))�

(PF(f )PX (x))1−�

� and I�

FY(f , y) =

1

1−�

�1 −

∑F,Y

(PFY (f ,y))�

(PF(f )PY (y))1−�

�.

(16) QEP: Wang and Liu (2008) proposed an image fusion metr ic from an edge information preservation perspective. They implemented a multiscale analysis and calculated the edge preservation value of the fused image scale by scale. The overall quality metric of QEP is derived by combining the metric at different scales via QEP =

∏N

i=1(Q

AB∕F

i)�i , where �i is used to adjust the relative

importance of different scales. The QAB∕F

i is computed by

QAB∕F

i=

∑m

∑n(EP

AFi(m,n)wA

i(m,n)+EPBF

i(m,n)wB

i(m,n))

∑m

∑n(w

Ai(m,n)+wB

i(m,n))

where each of

wAi and wB

i are computed by wA

i(m, n) = LH

2

Ai

(m, n)+

HL2

Ai

(m, n) + HH2

Ai

(m, n) . The global edge preservation value at scale i is derived by EPAF

i(m, n) =

�AF

H1(m,n)+�AF

V1(m,n)+�AF

D1(m,n)

3 ,

whe re �AFH1

(m, n) = exp(−|LHAi− LHF

i|) , �AF

V1(m, n) =

exp(−|HLAi− HL

F

i|) , and �AF

D1(m, n) = exp(−|HHA

i− HH

F

i|).

(17) rSFe (QSF): ratio of spatial frequency error (rSFe) is derived from the definition of the metric of spatial fre-quency (SF) that reflects the local intensity variation. The criterion of rSFe is given by SFe = (SFF − SFR)∕SFR , where SF =

√RF2 + CF2 +MDF2 + SDF2 . The RF, CF, MDF,

and SDF are four directional SFs which procedure for com-puting these parameters is given in (Zheng et al. 2007). In this criterion, an ideal fusion has rSFe = 0. The better the fused image, the smaller absolute value of rSFe. Besides, rSFe > 0 means that an over-fused image, which some dis-tortion or noise were introduced in the fused image. Also, rSFe < 0 denotes that an under-fused image, which some meaningful information is lost in the fused image.

(18) VIF: visual information fidelity (VIF) is an informa-tion fidelity criterion that quantifies Shannon information,

which is shared between the source images and the distorted images relative to the information contained in the source image itself (Sheikh and Bovik 2006), and it is given by

VIF =∑

j�subbands I(⇀

C

N,j

;⇀

F

N,j

�SN,j)

∑j�subbands I(

C

N,j

;⇀

E

N,j

�SN,j) , where I(

C

N

;⇀

E

N

|SN) =1

2

∑N

i=1

∑M

k=1log2

�1 +

s2i�k

�2n

� and I(

C

N

;⇀

F

N

|SN) =1

2

∑N

i=1

∑M

k=1log2

(1 +

g2s2i�k

�2v+�2

n

) . The procedure for computing other parameters

is given in Sheikh and Bovik (2006).(19) QCB: Chen and Blum (2009) proposed a perceptual

quality evaluation method for image fusion that is based on human visual system (HVS) models. In this metric, the source and fused images are filtered by a contrast sensitivity function (CSF) after computing a local contrast map for each image. Then, a contrast preservation map is generated to describe the relationship between the fused image and each source image. Finally, the preservation maps are weighted by a saliency map to obtain an overall quality map. The mean of the quality map indicates the quality of the fused image. The metric of QCB is given by QCB = QC(x, y) , where QC(x, y) = �A(x, y)QAF(x, y) + �B(x, y)QBF(x, y) . �A and �B are the saliency map for the input image A and B, respectively. QAF and QBF information preservation value of the fused image and the input images A and B, respectively. The full procedure for computing these parameters is given in Chen and Blum (2009).

3.3 Quantitative and qualitative evaluation

The assessment of the results with quantitate metrics is very hard and unreliable because the ideal image (ground-truth) of real multi-focus images are not yet available. Therefore, the better assessment is the visual and qualitatively comparisons. The initial segmented decision map of the methods is the first output of the main algorithm, and the final decision map is achieved after applying many post-processing algorithms such as consistency verification (CV), guiding filter, and small region removal, watershed, morphological filter (closing and opening) according of their methods (Zhou et al. 2014; Liu et al. 2015, 2017; Guo et al. 2015, 2018; Zhang et al. 2017; Li et al. 2013a; Du and Gao 2017; Tang et al. 2018). Whereas the large share of the suitable performance of the final deci-sion map quality is due to using vast post-processing algo-rithm, which it is a separated issue from the main algorithm or CNNs network. Therefore, the better method is to have the best matching of the initial segmented decision map according to the focused and unfocused region in the inputs multi-focus images. We provided four phases of the quantitative and quali-tative evaluation for proving the superiority of all the phases of the proposed method with the previous methods.

Page 10: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

The first phase of the experimentsIn the first phase of the experiments, the proposed

method of HCNN is compared with the MWGF (Zhou et al. 2014), DSIFT (Liu et al. 2015), SSDI (Guo et al. 2015), CNN (Liu et al. 2017), and MSCNN (Du and Gao 2017). The visual comparison of the initial and the final segmented decision map of the proposed method with those of the others are shown in Figs. 4 and 5. The two source images of color multi-focus “Flower” are shown in Fig. 4a, b. In addition, the difference images of between the source multi-focus images and the output fused image of methods are presented in Fig. 4. The final decision map of MWGF is depicted in Fig. 4c which shows the jagged and ringing artifacts in the edge areas of the decision map. The differ-ence images between the fused image of MWGF and the source images are depicted in Fig. 4m, n. The undesirable side effects that are visible in the difference images indi-cate that this method does not have adequate quality. The initial and the final decision map of SSDI are depicted in Fig. 4d, e, respectively. These decision maps are disheveled and have incorrect holes. The difference images between the fused image of SSDI and the source images are shown

in Fig. 4o, p where the weakness of this method is clear in their difference images. The initial and the final decision map of DSIFT are shown in Fig. 4f, g, respectively. Their decision map with or without post-processing have the jag-ged artifacts and does not have the best match according to the focus area of the source images. The difference images between the fused image of DSIFT and the source images are shown in Fig. 4q, r. The weakness of this method can be seen in the difference image that the pixels in the source images of the “Flower” region are mistakenly selected. In the remaining stages of this experiment, the state of the art convolution neural networks based methods of CNN and MSCNN are compared with the proposed method. The ini-tial segmented decision map and the final decision map of CNN are shown in Fig. 4h, i, respectively. The initial seg-mented decision map and the final decision map of MSCNN are shown in Fig. 4j, k, respectively. As mentioned, since the source code of MSCNN was not available and the initial and final segmented decision map of MSCNN are taken from its published paper (LeCun et al. 1998). The initial segmented decision map of the proposed method (HCNN) is shown in Fig. 4t which is obtained without any

Fig. 4 The source multi-focus images of “Flower”, the initial seg-mented decision map and the final decision map of the methods, and the difference image between the fused image and the source image. a The first source image. b The second source image. c MWGF. d SSDI without post-processing. e SSDI with post-processing. f DSIFT without post-processing. g DSIFT with post-processing. h CNN with-out post-processing. i CNN with post-processing. j MSCNN without

post-processing. k MSCNN with post-processing. l The initial seg-mented map of the proposed method of HCNN without post-pro-cessing. m, n The difference images of MWGF. o, p The difference images of SSDI. q, r The difference image of DSIFT. s, t The dif-ference images of CNN. u, v The difference images of the proposed method of HCNN. w The fused image using the proposed method of HCNN

Page 11: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

post-processing. By comparing the initial decision map of HCNN with CNN and MSCNN, it is obvious that the best initial decision map without any error and the best marching according to the focus region of the input images is obtained by our proposed method HCNN. In addition, the initial decision map of the proposed method is still better than that of the final decision map of CNN and MSCNN which are obtained after applying many post-processing. The differ-ence images of CNN are shown in Fig. 4s, t. The difference images of HCNN are shown in Fig. 4u, v. The poor quality of the fused image of this method can be understood from the difference images. The difference images of HCNN are very accurate and show the best matching comparing with the difference images of CNN. By comparing the difference image of HCNN with the input multi-focus images, it is obvious that there is not any error and mistakenly selected pixel. The fused image according to the initial segmented decision map of HCNN is shown in Fig. 4w that it has been created without any post-processing. The more suitable quality of the fused image of the proposed method is obvi-ously clear.

In another visual assessment, the initial segmented deci-sion map of HCNN for the multi-focus images of “Leop-ard” is compared with the initial and the final segmented decision map of MWGF, SSDI, DSIFT, and CNN in Fig. 5. The two source images of “Leopard” gray-scale multi-focus

images are shown in Fig. 5a, b. The final decision map of MWGF is shown in Fig. 5c which have jagged and ring-ing artifacts. The initial segmented decision map and the final decision map of SSDI, DSIFT, and CNN are shown in Fig. 5d–i, respectively. The initial segmented decision map of HCNN is shown in Fig. 5j. The initial segmented decision map of SSDI, DSIFT, and CNN methods are not yet accept-able whereas the initial segmented decision of HCNN is very satisfactory. In addition, the initial segmented decision map of HCNN is still better than the others which are the results of applying many post-processing algorithms. The difference images of HCNN are shown in Fig. 5l, m indicate the good accuracy of the proposed method for this part of the experiment.

In overall, these states of the art methods (MWGF, DSIFT, SSDI, CNN, and MSCNN) have the best results among the introduced methods for multi-focus image fusion in the recent years. However, the initial segmented decision map of these methods are still undesirable and unaccepta-ble considering the focused area of the source images. In addition, their final decision maps are still unsatisfactory after applying many post-processing algorithms on the initial decision map. By visual observation, we demonstrated that the initial segmented decision map of the proposed method (HCNN), which has not used any post-processing algorithm,

Fig. 5 The initial and final segmented decision map of the proposed method and the others for “Leopard” multi-focus image. a The first source image, b the second source image, c MWGF, d SSDI with-out post-processing, e SSDI with post-processing, f DSIFT without post-processing, g DSIFT with post-processing, h CNN without

post-processing, i CNN with post-processing, j the initial segmented map of HCNN without post-processing (the proposed method), k the fused image using HCNN (the proposed method), l, m the difference images of HCNN (the proposed method)

Page 12: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

is very better than that of the others (with or without apply-ing many post-processing algorithms).

As mentioned, since the ground-truth of the real multi-focus images is not available, the best assessments are visual and qualitatively assessments; quantity assessments would not be always reliable. However, we compared the proposed method with the others in Table 3 by quality fusion met-rics using the reported results of Du and Gao (2017) for quantity assessment. These experiments were performed for seven most used color and gray-scale multi-focus images. These source multi-focus images and the fused images by the proposed method of HCNN are shown in Fig. 6. Five images are in color and two images are in gray-scale which have various sizes. As expected, the results of the proposed method are overall better than that of the others in the quan-titative assessment of Table 3 in case of MI, QAB/F and QY (SSIM). The values of QY (SSIM) for 1 HCNN are better than that of the other methods. The values of MI and QAB/F for HCNN show better results for the most test multi-focus images.

The second phase of the experimentIn the second phase of the experiments, our proposed

method of HCNN is compared with the SDMF (Baviri-setti and Dhuli 2016), MWGF (Zhou et al. 2014), CBF

(Kumar 2015), BFMM (Zhang et al. 2017), GFF (Li et al. 2013a), IMF (Li et al. 2013b), and CNN (Liu et al. 2017). The visual comparison of the fused image and the differ-ence image of the fused images of the mentioned methods and the source images are shown in Figs. 7 and 8. The two source images of multi-focus “Balloon” are shown in Fig. 7a, b. The fused image and the difference images of MWGF are shown in Fig. 7c–e. The wide range of mis-taken area from the focused area of the source images can be seen in the difference image of this method. The fused image and the difference images of GFF are shown in Fig. 7g, h. In the difference images of this method, some regions are mistaken from the source images. The fused image and the difference images of BFMM are shown in Fig. 7i–k. In the difference images of this method, there is a little mistaken area from the source images. The fused image and the difference images of IMF are shown in Fig. 7 l–n. The big errors and the mistaken area can be eas-ily seen in the difference images of this method that leads to a low quality of the fused image. The fused image and the difference images of SDMF are shown in Fig. 7o–q. This method has a terrible difference image, which pixels of the fused image has no logical relation to the intensity of the source images. The fused image of SDMF might has

Table 3 Comparison of objective quality fusion metrics of the proposed multi-focus image fusion method and the others (aFrom Du and Gao 2017)

Test images Fusion metrics MWGFa

Zhou et al. (2014)SSDIa

Guo et al. (2015)DSIFTa

Liu et al. (2015)CNNa

Liu et al. (2017)MSCNNa

Du and Gao (2017)

HCNN (proposed)

Lab MI 8.0618 8.1412 8.2501 8.6008 8.8044 8.8521QAB/F 0.7147 0.7528 0.7585 0.7573 0.7588 0.7588QY (SSIM) 0.8746 0.8823 0.9132 0.8947 0.9148 0.9712

Temple MI 5.9655 7.0896 7.3514 6.8895 7.4177 7.3669QAB/F 0.7501 0.7634 0.7643 0.7590 0.7623 0.7673QY (SSIM) 0.8992 0.9125 0.9138 0.9063 0.9251 0.9943

Seascape MI 7.1404 7.4824 7.9487 7.6285 8.0214 8.1035QAB/F 0.7059 0.7110 0.7126 0.7113 0.7122 0.7169QY (SSIM) 0.9366 0.9473 0.9452 0.9481 0.9547 0.9922

Book MI 8.2368 8.4008 8.6623 8.7796 8.8947 8.9210QAB/F 0.7240 0.7260 07134 0.7277 0.7284 0.7260QY (SSIM) 0.9120 0.9221 0.9045 0.9374 0.9473 0.9730

Leopard MI 9.9474 10.8887 10.9226 10.8792 10.9420 10.9371QAB/F 0.8175 0.8171 0.8069 0.7973 0.8267 0.8275QY (SSIM) 0.9435 0.9325 0.9572 0.9218 0.9748 0.9934

Children MI 8.2622 7.8505 8.5252 8.3338 8.5363 8.4289QAB/F 0.6741 0.6799 0.7394 0.7408 0.7384 0.7446QY (SSIM) 0.8675 0.8752 0.9255 0.9263 0.9341 0.9877

Flower MI 8.3255 8.1049 8.5365 8.2695 8.6125 8.5466QAB/F 0.6913 0.6490 0.7159 0.7183 0.7157 0.7182QY (SSIM) 0.9460 0.9207 0.9479 0.9566 0.9689 0.9833

Page 13: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

good contrast, but this method is greatly over fused. The big errors and the mistaken area can be easily seen in the difference images of this method that shows a low quality of the fused image. The fused image and the difference images of CBF are shown in Fig. 7r–t. There are a huge unsuitable selected areas and pixels of the fused image easily seen in the difference images. The fused image and the difference images of CNN are shown in Fig. 7u–w. This method has a very little error in the difference images according to the focused area of the source images. The fused image and the difference images of our proposed method of HCNN are shown in Fig. 7x–z. The difference images of our proposed method have the cleanest area and the best matching according to the focused area of the source images among the difference images of the other methods.

In similar qualitative evaluation of this phase of experi-ments, the source multi-focus images of the “Coca-Cola”, the fused images and the difference images of the methods are shown in Fig. 8. As we expect, the difference images of our proposed method of HCNN have the cleanest difference

images and the best matching and quality among the other methods.

In the quantitative evaluation of the second phase of experiments, our proposed method is compared with those other methods with 18 fusion metrics in Table 4. The fusion metrics of API, SD, AG, and H are not reliable for the assess-ment of the state of the art multi-focus image fusion meth-ods since they have just measured the fused image and do not consider information from the source images. In addi-tion, these metrics do not measure the relevance of between the fused image and the source images. For example, the higher values of API, SD, SF, and AG are for the method of SDMF, but in the qualitative evaluations we have seen that this method is over fused and have the terrible difference images. Likewise, the higher value of the fusion metric of H is for the method of CBF, while this method does not show comparable quality and clean difference image among the others. The fusion metrics of QAB/F, LAB/F, and NAB/F are different in coding and parameter setting. The used QAB/F, LAB/F, and NAB/F metrics in Table 4 is according to the papers of (Kumar 2013, 2015). On the overall, the better scores

Fig. 6 The other multi-focus images and the final fused images of the proposed method which not shown in the visual comparison of the first phase of the experiments. a The first source image, b the second source image, f the fused image of the proposed method of HCNN

Page 14: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

of these metrics are for the CNN and HCNN. On the other hand, for a slightly more accurate and fair comparison, we conducted quantitative assessments with the famous and bet-ter fusion metrics of QMI, QTE, QEP, QNICE, rSFe, QY(SSIM), QCB in Table 4. On the overall observation, the higher scores of these metrics are for HCNN and then for CNN.

The third phase of the experimentIn the third phase of the experiments, the proposed

method of HCNN is compared with NSCT (Zhang and Guo 2009), DCHWT (Kumar 2013), MWGF (Zhou et al. 2014), CBF (Kumar 2015), DSIFT (Liu et  al. 2015), BFMM (Zhang et al. 2017), GFF (Li et al. 2013a), IMF (Li et al. 2013b), CNN (Liu et al. 2017), and FCN (Guo et al. 2018). In this experiment, we used 20 pairs of the color multi-focus image of Lytro dataset (Nejati). The ini-tial segmentation map—without applying any post-pro-cessing algorithms—for our proposed method of HCNN

and those others for “Children” multi-focus image are shown in Fig. 9. The two source images of multi-focus “Children” are shown in Fig. 9a, b. The initial decision map of MWGF, IMF, GFF, BFMM, SSDI, DSIFT, CNN, FCN and our proposed method of HCNN are shown in Fig. 9c–k, respectively. The initial decision map of MWGF has a ringing artifact in the edges. The initial decision map of IMF is very ambiguous and unacceptable. The initial decision map of GFF has very mistaken area according to the focused area of the source images. The initial decision map of BFMM has jagged artifacts and mistaken areas. The initial decision map of SSDI has very unsuitable focus map selection. The initial decision map of DSIFT has an area that selected incorrectly according to the focused region of the source images. The initial decision map of CNN has some small mistaken holes. The initial decision map of FCNN has a terrible decision map which it is very different

Fig. 7 The source multi-focus images of “Balloon”, the fused images and the difference image between the fused image and the source image of the multi-focus image fusion methods. a The first source image, b the second source image. c MWGF, d, e the differ-ence images of MWGF, f GFF, g, h the difference images of GFF, i

BFMM, j, k the difference images of BFMM, l IMF, m, n the differ-ence images of IMF, o SDMF, p, q the difference images of SDMF, r CBF, s, t the difference images of CBF, u CNN, v, w the difference images of CNN. x The proposed method of HCNN, y, z the difference images of the proposed method of HCNN

Page 15: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

from the others and it needs an extra post-processing to refine the decision map. In other words, the initial decision map of FCN has no superior against the others. Therefore, all of the initial decision maps of the mention methods have significantly required the vas post-processing algorithms to refine their decision map. The initial decision map of our proposed method HCNN, which is obtained without applying any post-processing, has the cleanest decision map and has the least demanding to post-processing steps. The source images of Lyrto dataset, the final decision map and the fused image of our proposed method of HCNN are shown in Fig. 10.

We compared our proposed method with those others in Table 5 using the quality fusion metrics of MI, QAB/F, and VIF for quantity assessment. Since the source code of the method of Guo et al. (2018) was not provided, the reported results are used in Table 5. The average scores of these metrics are listed in Table 5. As can be seen from Table 5, our proposed method outperforms the others with respect to MI, QAB/F, and VIF.

The fourth phase of the experiment

In the fourth phase of the experiments, we demonstrate that our proposed method of HCNN can easily be extended to fuse more than two input multi-focus images. Some previous papers have worked for just two input images; some of the other researches have considered two images A and B for simplicity of introducing a fusion scheme (Liu et al. 2015, 2017; Guo et al. 2015, 2018; Zhang et al. 2017; Du and Gao 2017; Tang et al. 2018; Xu et al. 2018). Since the multi-focus image fusion is considered as a classifica-tion problem in our paper, the proposed fusion scheme can easily be extended to more than two input multi-focus images. In this way, suppose there are three input source images of A, B, and C. The images of A and B are fused with our proposed method of HCNN, and then the output image is fused again with the image C. Therefore, the pro-posed method can fuse the multiple input images. To prove this discussion, we fused the one sample of a triple series of multi-focus images of Lytro dataset. The three source input images and the fused image of our proposed method of HCNN are shown in Fig. 11.

Fig. 8 The source multi-focus images of “Coca-Cola”, the fused images and the difference image between the fused image and the source image of the multi-focus image fusion methods. a The first source image, b the second source image. c MWGF, d, e the differ-ence images of MWGF, f GFF, g, h the difference images of GFF, i

BFMM, j, k the difference images of BFMM, l IMF, m, n the differ-ence images of IMF, o SDMF, p, q the difference images of SDMF, r CBF, s, t the difference images of CBF, u CNN, v, w the difference images of CNN. x The proposed method of HCNN, y, z the difference images of the proposed method of HCNN

Page 16: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

3.4 Discussion results and future directions

There are two main limitations for our proposed method. First, it is necessary to register the input images, which this limitation can be seen for all of multi-focus image fusion papers. Second, the proposed network train through the supervised learning algorithm. That would be more valu-able if the future researchers try to introduce a framework

that it does not need the label for the multi-focus dataset. Also because of the absence of comprehensive multi-focus dataset for deep learning based method, the future research-ers can construct and collect a comprehensive multi-focus images dataset.

In this paper, the simplest ensemble learning is used for final prediction from three branches of CNNs and we firstly utilized the ensemble learning into image fusion

Table 4 Comparison of objective quality metrics of the proposed multi-focus image fusion method and the others

Test images Fusion metrics

MWGFZhou et al. (2014)

GFFLi et al. (2013a)

BFMMZhang et al. (2017)

IMFLi et al. (2013b)

SDMFBavirisetti and Dhuli (2016)

CBFKumar (2015)

CNNLiu et al. (2017)

HCNN (proposed)

Balloon API 113.7325 113.7191 113.7219 113.7048 114.2044 113.6658 113.7233 113.7231SD 48.1781 48.3562 48.3746 48.3363 54.4459 47.7738 48.3741 48.3750AG 9.6691 9.8115 9.8226 9.8180 17.1862 9.4899 9.8243 9.8253H 7.4590 7.4386 7.4336 7.4383 7.4314 7.4622 7.4337 7.4335MI 10.4531 11.1296 11.1809 11.0551 5.4343 10.3518 11.1798 11.1852FS 1.9949 1.9949 1.9968 1.9946 1.9980 1.9950 1.9966 1.9967CC 0.9903 0.9908 0.9908 0.9908 0.9741 0.9916 0.9908 0.9908SF 20.4795 20.7875 20.8192 20.7984 35.5096 19.9520 20.8223 20.8242QAB/F 0.9218 0.9274 0.9272 0.9254 0.7997 0.9387 0.9272 0.9271LAB/F 0.0740 0.0724 0.0728 0.0743 0.0224 0.0608 0.0728 0.0729NAB/F 0.0042 0.0001812 0.000002 0.000283 0.1779 0.000483 0.000416 0.000002QMI 1.4033 1.4962 1.5036 1.4862 0.7309 1.3894 1.5034 1.5042QTE 0.4196 0.4312 0.4321 0.4294 0.3181 0.4221 0.4321 0.4322QNice 0.8656 0.8729 0.8735 0.8720 0.8306 0.8645 0.8735 0.8735QEP 2.1315 2.6990 2.7686 2.5293 0.1435 1.9022 2.7680 2.7713rSFe (QSF) − 0.0253 − 0.0115 − 0.0100 − 0.0113 0.6686 − 0.0516 − 0.0098 − 0.0097QY (SSIM) 0.9855 0.9935 0.9937 0.9909 0.7335 0.9875 0.9938 0.9938QCB 0.8646 0.9040 0.9057 0.8821 0.4706 0.8812 0.9059 0.9060

Coca-Cola API 108.5607 108.5700 108.5657 108.5694 108.6970 108.5974 108.5714 108.5664SD 61.4838 61.4632 61.4990 61.5157 63.2648 61.2062 61.4887 61.5020AG 4.6663 4.6319 4.6719 4.7247 7.1961 4.5546 4.6555 4.6741H 7.6550 7.6545 7.6526 7.6559 7.6576 7.6583 7.6517 7.6525MI 11.5707 11.4172 11.6308 11.4505 8.2256 11.3704 11.8342 11.6204FS 1.9995 1.9771 1.9976 1.9981 1.9985 1.9990 1.9999 1.9968CC 0.9947 0.9947 0.9947 0.9945 0.9922 0.9950 0.9948 0.9947SF 12.0751 12.0323 12.0885 12.2490 17.8099 11.6628 12.0537 12.0948QAB/F 0.9160 0.9144 0.9161 0.9117 0.8538 0.9221 0.9162 0.9160LAB/F 0.0826 0.0815 0.0836 0.0837 0.0344 0.0769 0.0838 0.0832NAB/F 0.0014 0.0041 0.000314 0.0046 0.1118 0.000972 0.0000051 0.0007382QMI 1.4936 1.5137 1.5220 1.4978 0.9940 1.4123 1.5227 1.5226QTE 0.4210 0.4232 0.4236 0.4193 0.3595 0.4160 0.4237 0.4236QNice 0.8782 0.8798 0.8805 0.8783 0.8446 0.8710 0.8806 0.8806QEP 2.4631 2.6259 2.7715 2.5609 0.2821 1.5274 2.7544 2.7620rSFe (QSF) − 0.0138 − 0.0104 − 0.0093 0.0016 0.4435 − 0.0399 − 0.0090 − 0.0088QY (SSIM) 0.9779 0.9814 0.9858 0.9753 0.5827 0.9618 0.9856 0.9864QCB 0.7913 0.7942 0.8057 0.7859 0.4728 0.7413 0.8057 0.8041

Page 17: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

applications. The main ideas of this paper are devising a new strategy of patch feeding the network and design-ing an ensemble learning based network for multi-focus image fusion which is trained on the three type of data-sets. These three type of datasets have been separately created; so, using and consulting them together makes a better accuracy and the cleaner initial decision map. In addition, we introduced minor changes the proce-dure of constructing of the decision map. These works make our proposed network able to create the cleanest and ideal initial segmented decision map with the sim-plest one for ensemble learning methods of hard voting. The hard voting for the ensemble learning part of the proposed network is reliable because not only it has the lowest complexity and suitable speed among the other ensemble learning methods, but also it achieves the ideal initial decision map. Therefore, the more tradition and complex choice of ensemble learning method may not increase more the accuracy of the initial decision map and it will be unproductive. However, the future researches can focus on devising a more complex ensemble learning based method for significant influencing to our proposed network.

4 Conclusions

We proposed a new convolution neural network based on ensemble learning for the multi-focus image fusion. This network contains three branches of single CNNs, which are trained on the three different datasets, and then the final prediction is achieved by hard voting of the prediction

of each branch of CNNs. The three type of datasets were created by gradient in horizontal and vertical directions, which they bring useful information of the focused and unfocused details for the better prediction of the network. These datasets were fed to the network with the proposed arrangement of the patches feeding in this paper. This new arrangement helped significantly for construction cleaner initial decision map. In qualitative and quantity assess-ments, it was shown that the initial segmented decision map of the proposed network is more accurate and cleaner than the initial and the final segmented decision maps of the previous state of the art methods. In other words, the initial segmented decision map of our proposed method without any post-processing has cleaner and better accu-racy than the other decision maps with or without applying many post-processing algorithms. The many qualitatively and quantitatively experiments and analysis performed by 28 pairs of famous real multi-focus images and 19 quality fusion metrics for comparing the proposed method with the 13 previous states of the art methods. The qualitative experiments were illustrated that the proposed method has the cleanest initial decision map among the others. Also, the quantitative experiments were demonstrated that the output fused image of the proposed method has the best quality among the others. The source code of the pro-posed method and all of the supplementary files such as test multi-focus images and the fusion metrics are pro-vided on the personal website1 and GitHub2 of the this paper’s authors.

Fig. 9 The initial segmentation decision map without any post-processing algorithms of the proposed method and the others for “Children” image, a the first image, b the second image, c MWGF,

d IMF, e GFF, f BFMM, g SSDI, h DSIFT, i CNN, j FCN, k HCNN (proposed method), l the fused image using the proposed method of HCNN

1 http://www.amin-naji.com and http://www.image fusio n.ir2 http://www.githu b.com/mosta faami nnaji

Page 18: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

Fig. 10 The Lytro multi-focus image dataset, the final decision map of the proposed method of HCNN, and the fused images of the pro-posed method of HCNN. The symbols of “A” and “B” are stand for

the source input multi-focus images. The symbols of “M” and “F” are stand for the final decision map and the fused image of the proposed method of HCNN, respectively

Page 19: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

References

Amin-Naji M, Aghagolzadeh A (2017a) Block DCT filtering using vector processing. arXiv preprint arXiv:1710.07193

Amin-Naji M, Aghagolzadeh A (2017b) Multi-focus image fusion using VOL and EOL in DCT domain. arXiv preprint arXiv:1710.06511

Amin-Naji M, Aghagolzadeh A (2018) Multi-focus image fusion in DCT domain using variance and energy of Laplacian and cor-relation coefficient for visual sensor networks. J Artif Intell Data Min 6:233–250. https ://doi.org/10.22044 /jadm.2017.5169.1624

Amin-Naji M, Ranjbar-Noiey P, Aghagolzadeh A (2017) Multi-focus image fusion using singular value decomposition in DCT domain. In: 10th Iranian conference on machine vision and image process

Table 5 MI, QAB/F, and VIF comparison of various image fusion methods on Lytro dataset (aFrom Guo et al. 2018)

Methods Average values for the 20 pairs of multi-focus images of Lytro dataset shown in Fig. 10

MI QAB/F VIF

NSCTa

Zhang and Guo (2009)3.1473 0.5709 0.5132

GFFa

Li et al. (2013a)4.1211 0.7601 0.7430

IMFa

Li et al. (2013b)4.2879 0.7534 0.7233

CBFa

Kumar (2015)3.8211 0.7528 0.6870

DCHWTa

Kumar (2013)3.3649 0.7124 0.6465

MWGFa Zhou et al. (2014) 4.2336 0.7479 0.7316BFMMa

Zhang et al. (2017)4.4376 0.7572 0.7412

DSIFTa

Liu et al. (2015)4.4588 0.7621 0.7492

CNNa

Liu et al. (2017)4.3211 0.7618 0.7465

FCNa

Guo et al. (2018)4.4578 0.7655 0.7531

HCNN (the proposed) 4.4634 0.7701 0.7572

Fig. 11 The sample of triple series of multi-focus images fusion from Lytro dataset and the fusion result of the proposed method. a The first source image of A, b the second source image of B, c the third source image of C, d the fused image of A and B with the proposed method of HCNN, e the final fused image of A, B, and C with the proposed method of HCNN

Page 20: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

M. Amin-Naji et al.

1 3

(MVIP). IEEE, New York, pp 45–51. https ://doi.org/10.1109/Irani anMVI P.2017.83423 67

Bavirisetti DP (2019) https ://sites .googl e.com/view/durga prasa dbavi riset ti/datas ets. Accessed 6 Jan 2019

Bavirisetti DP, Dhuli R (2016) Multi-focus image fusion using multi-scale image decomposition and saliency detection. Ain Shams Eng J. https ://doi.org/10.1016/j.asej.2016.06.011

Cao L, Jin L, Tao H, Li G, Zhuang Z, Zhang Y (2015) Multi-focus image fusion based on spatial frequency in discrete cosine trans-form domain. IEEE Signal Process Lett 22:220–224. https ://doi.org/10.1109/LSP.2014.23545 34

Chen Y, Blum RS (2009) A new automated quality assessment algo-rithm for image fusion. Image Vis Comput 27:1421–1432. https ://doi.org/10.1016/j.imavi s.2007.12.002

CNN from Wikipedia, the free encyclopedia (2019) http://cs231 n.githu b.io/convo lutio nal-netwo rks/. Accessed 6 Jan 2019

Cvejic N, Canagarajah C, Bull D (2006) Image fusion metric based on mutual information and Tsallis entropy. Electron Lett 42:626–627. https ://doi.org/10.1049/el:20060 693

Deep Learning from Wikipedia, the free encyclopedia (2019) https ://en.wikip edia.org/wiki/Deep_learn ing. Accessed 6 Jan 2019

Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier system. Springer, New York, pp 1–15. https ://doi.org/10.1007/3-540-45014 -9_1

Dogra A, Goyal B, Agrawal S (2017) From multi-scale decomposi-tion to non-multi-scale decomposition methods: a comprehen-sive survey of image fusion techniques and its applications. IEEE Access 5:16040–16067. https ://doi.org/10.1109/ACCES S.2017.27358 65

Du C, Gao S (2017) Image segmentation-based multi-focus image fusion through multi-scale convolutional neural network. IEEE Access 5:15750–15761. https ://doi.org/10.1109/ACCES S.2017.27350 19

Du C-b, Gao S-s (2018) Multi-focus image fusion with the all convo-lutional neural network. Optoelectron Lett 14:71–75. https ://doi.org/10.1007/s1180 1-018-7207-x

Ghassemian H (2016) A review of remote sensing image fusion methods. Inf Fusion 32:75–89. https ://doi.org/10.1016/j.inffu s.2016.03.003

Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT Press, Cambridge

Guo D (2019) http://en.pudn.com/Downl oad/item/id/26789 55.html. Accessed 6 Jan 2019

Guo D, Yan J, Qu X (2015) High quality multi-focus image fusion using self-similarity and depth information. Opt Commun 338:138–144. https ://doi.org/10.1016/j.optco m.2014.10.031

Guo X, Nie R, Cao J, Zhou D, Qian W (2018) Fully Convolutional network-based multifocus image fusion. Neural Comput 30:1775–1800. https ://doi.org/10.1162/neco_a_01098

Haghighat MBA, Aghagolzadeh A, Seyedarabi H (2011) Multi-focus image fusion for visual sensor networks in DCT domain. Com-put Electr Eng 37:789–797. https ://doi.org/10.1016/j.compe lecen g.2011.04.016

Hossny M, Nahavandi S, Creighton D (2008) Comments on ‘Infor-mation measure for performance of image fusion’. Electron Lett 44:1066–1067. https ://doi.org/10.1049/el:20081 754

Huang W, Jing Z (2007) Evaluation of focus measures in multi-focus image fusion. Pattern Recognit Lett 28:493–500. https ://doi.org/10.1016/j.patre c.2006.09.005

James AP, Dasarathy BV (2014) Medical image fusion: a sur-vey of the state of the art. Inf Fusion 19:4–19. https ://doi.org/10.1016/j.inffu s.2013.12.002

Kang X (2019) http://xudon gkang .weebl y.com/. Accessed 6 Jan 2019Krähenbühl P, Koltun V (2011) Efficient inference in fully connected

CRFS with Gaussian edge potentials. In: Advances in neural information processing systems, pp 109–117

Kumar BS (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal Image Video Process 7:1125–1143. https ://doi.org/10.1007/s1176 0-012-0361-x

Kumar BS (2015) Image fusion based on pixel significance using cross bilateral filter. Signal Image Video Process 9:1193–1204. https ://doi.org/10.1007/s1176 0-013-0556-9

Kumar BKS (2019) https ://mathw orks.com/matla bcent ral/filee xchan ge/43781 -image -fusio n-based -on-pixel -signi fican ce-using -cross -bilat eral-filte r. Accessed 6 Jan 2019

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324. https ://doi.org/10.1109/5.72679 1

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436. https ://doi.org/10.1038/natur e1453 9

Li H, Manjunath B, Mitra SK (1995) Multisensor image fusion using the wavelet transform. Graph Models Image Process 57:235–245. https ://doi.org/10.1006/gmip.1995.1022

Li M, Cai W, Tan Z (2006) A region-based multi-sensor image fusion scheme using pulse-coupled neural network. Pattern Recognit Lett 27:1948–1956. https ://doi.org/10.1016/j.patre c.2006.05.004

Li S, Kang X, Hu J (2013a) Image fusion with guided filtering. IEEE Trans Image Process 22:2864–2875. https ://doi.org/10.1109/TIP.2013.22442 22

Li S, Kang X, Hu J, Yang B (2013b) Image matting for fusion of multi-focus images in dynamic scenes. Inf Fusion 14:147–162. https ://doi.org/10.1016/j.inffu s.2011.07.001

Li S, Kang X, Fang L, Hu J, Yin H (2017) Pixel-level image fusion: a survey of the state of the art. Inf Fusion 33:100–112. https ://doi.org/10.1016/j.inffu s.2016.05.004

Liang J, He Y, Liu D, Zeng X (2012) Image fusion using higher order singular value decomposition. IEEE Trans Image Process 21:2898–2909. https ://doi.org/10.1109/TIP.2012.21831 40

Liu Y (2019b) http://www.escie nce.cn/peopl e/liuyu 1/Codes .html. Accessed 6 Jan 2019

Liu Z (2019a) https ://githu b.com/zheng liu66 99/image Fusio nMetr ics. Accessed 6 Jan 2019

Liu Y, Liu S, Wang Z (2015) Multi-focus image fusion with dense SIFT. Inf Fusion 23:139–155. https ://doi.org/10.1016/j.inffu s.2014.05.004

Liu Y, Chen X, Peng H, Wang Z (2017) Multi-focus image fusion with a deep convolutional neural network. Inf Fusion 36:191–207. https ://doi.org/10.1016/j.inffu s.2016.12.001

Liu Y, Chen X, Wang Z, Wang ZJ, Ward RK, Wang X (2018a) Deep learning for pixel-level image fusion: recent advances and future prospects. Inf Fusion 42:158–173. https ://doi.org/10.1016/j.inffu s.2017.10.007

Liu C, Long Y, Mao J, Zhang H, Huang R, Dai Y (2018b) An effec-tive image fusion algorithm based on grey relation of similarity and morphology. J Ambient Intell Hum Comput. https ://doi.org/10.1007/s1265 2-018-0873-5

Ma J, Ma Y, Li C (2019) Infrared and visible image fusion methods and applications: a survey. Inf Fusion 45:153–178. https ://doi.org/10.1016/j.inffu s.2018.02.004

Maji D, Santara A, Mitra P, Sheet D (2016) Ensemble of deep convo-lutional neural networks for learning to detect retinal vessels in fundus images. arXiv preprint arXiv:160304833

Microsoft COCO Dataset (2019) http://cocod atase t.org/. Accessed 6 Jan 2019

Naji MA, Aghagolzadeh A (2015a) A new multi-focus image fusion technique based on variance in DCT domain. In: 2nd interna-tional conference on knowledge-based engineering and innovation (KBEI). IEEE, New York, pp 478–484. https ://doi.org/10.1109/KBEI.2015.74360 92

Naji MA, Aghagolzadeh A (2015b) Multi-focus image fusion in DCT domain based on correlation coefficient. In: 2nd international

Page 21: CNNs hard voting for multi-focus image fusionstatic.tongtianta.site/paper_pdf/cf0d3b76-89d9-11e9-8271...Vol.:0123567891 Journal of Ambient Intelligence and Humanized Computing ORIGINAL

CNNs hard voting for multi-focus image fusion

1 3

conference on knowledge-based engineering and innovation (KBEI). IEEE, New York, pp 632–639. https ://doi.org/10.1109/KBEI.2015.74361 18

Nejati M, Lytro Multi-focus Dataset (2019) https ://manso urnej ati.ece.iut.ac.ir/conte nt/lytro -multi -focus -datas et. Accessed 6 Jan 2019

Nejati M, Samavi S, Shirani S (2015) Multi-focus image fusion using dictionary-based sparse representation. Inf Fusion 25:72–84. https ://doi.org/10.1016/j.inffu s.2014.10.004

Nejati M, Samavi S, Karimi N, Soroushmehr SR, Shirani S, Roosta I, Najarian K (2017) Surface area-based focus criterion for multi-focus image fusion. Inf Fusion 36:284–295. https ://doi.org/10.1016/j.inffu s.2016.12.009

Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198. https ://doi.org/10.1613/jair.614

Pertuz S, Puig D, Garcia MA (2013) Analysis of focus measure opera-tors for shape-from-focus. Pattern Recognit 46:1415–1432. https ://doi.org/10.1016/j.patco g.2012.11.011

Petrovic VS, Xydeas CS (2004) Gradient-based multiresolution image fusion. IEEE Trans Image Process 13:228–237. https ://doi.org/10.1109/TIP.2004.82382 1

Petrovic V, Xydeas C (2005) Objective image fusion performance char-acterisation. In: Tenth IEEE international conference on computer vision, ICCV 2005. IEEE, New York, pp 1866–1871. https ://doi.org/10.1109/ICCV.2005.175

Phamila YAV, Amutha R (2014) Discrete cosine transform based fusion of multi-focus images for visual sensor networks. Signal Process 95:161–170. https ://doi.org/10.1016/j.sigpr o.2013.09.001

Sheikh H, Bovik A (2006) Image information and visual quality. IEEE Trans Image Process 15:430–444. https ://doi.org/10.1109/TIP.2005.85937 8

Smith MI, Heather JP (2005) A review of image fusion technology in 2005. In: Thermosense XXVII (ed) International Society for Optics and Photonics, pp 29–46. https ://doi.org/10.1117/12.59761 8

Stathaki T (2011) Image fusion: algorithms and applications. Elsevier, London

Tang H, Xiao B, Li W, Wang G (2018) Pixel convolutional neural net-work for multi-focus. Image Fusion Inf Sci 433:125–141. https ://doi.org/10.1016/j.ins.2017.12.043

Wang Y (2019) https ://githu b.com/budao xiaow anzi/image -fusio n. Accessed 6 Jan 2019

Wang P-W, Liu B (2008) A novel image fusion metric based on multi-scale analysis. In: 9th international conference on signal process, ICSP 2008. IEEE, New York, pp 965–968. https ://doi.org/10.1109/ICOSP .2008.46972 88

Wang Q, Shen Y, Jin J (2008) Performance evaluation of image fusion techniques. Image Fusion Algorithms Appl 19:469–492

Wu W, Yang X, Pang Y, Peng J, Jeon G (2013) A multifocus image fusion method by using hidden Markov model. Opt Commun 287:63–72. https ://doi.org/10.1016/j.optco m.2012.08.101

Xu K, Qin Z, Wang G, Zhang H, Huang K, Ye S (2018) Multi-focus image fusion using fully convolutional two-stream network for visual sensors. KSII Trans Internet Inf Syst. https ://doi.org/10.3837/tiis.2018.05.019

Xydeas CA, Petrovic V (2000) Objective image fusion performance measure. Electron Lett 36:308–309. https ://doi.org/10.1049/el:20000 267

Yang C, Zhang J-Q, Wang X-R, Liu X (2008) A novel similarity based quality metric for image fusion. Inf Fusion 9:156–160. https ://doi.org/10.1016/j.inffu s.2006.09.001

Yang Y, Yang M, Huang S, Que Y, Ding M, Sun J (2017) Multifo-cus image fusion based on extreme learning machine and human visual system. IEEE Access 5:6989–7000. https ://doi.org/10.1109/ACCES S.2017.26961 19

Zhang Y (2019) https ://githu b.com/uzefu l/Bound ary-Findi ng-based -Multi -focus -Image -Fusio n. Accessed 6 Jan 2019

Zhang Q, Guo B-l (2009) Multifocus image fusion using the nonsub-sampled contourlet transform. Signal Process 89:1334–1346. https ://doi.org/10.1016/j.sigpr o.2009.01.012

Zhang Y, Bai X, Wang T (2017) Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure. Inf Fusion 35:81–101. https ://doi.org/10.1016/j.inffu s.2016.09.006

Zheng Y, Essock EA, Hansen BC, Haun AM (2007) A new metric based on extended spatial frequency and its application to DWT based fusion algorithms. Inf Fusion 8:177–192. https ://doi.org/10.1016/j.inffu s.2005.04.003

Zhiqiang Z (2019) http://www.pudn.com/Downl oad/item/id/30689 67.html. Accessed 6 Jan 2019

Zhou Z-H, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137:239–263. https ://doi.org/10.1016/S0004 -3702(02)00190 -X

Zhou Z, Li S, Wang B (2014) Multi-scale weighted gradient-based fusion for multi-focus images. Inf Fusion 20:60–72. https ://doi.org/10.1016/j.inffu s.2013.11.005

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.