remote sensing image scene classification meets deep ... · abstract—remote sensing image scene...

22
JOURNAL OF L A T E X CLASS FILES, VOL. X, NO. X, XXXX 2019 1 Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities Gong Cheng, Xingxing Xie, Junwei Han, Senior Member, IEEE, Lei Guo, Gui-Song Xia, Senior Member, IEEE Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic categories based on their contents, has broad applications in a range of fields. Propelled by the powerful feature learning capabilities of deep neural networks, remote sensing image scene classification driven by deep learning has drawn remarkable attention and achieved significant breakthroughs. However, to the best of our knowledge, a comprehensive review of recent achieve- ments regarding deep learning for scene classification of remote sensing images is still lacking. Considering the rapid evolution of this field, this paper provides a systematic survey of deep learning methods for remote sensing image scene classification by covering more than 160 papers. To be specific, we discuss the main challenges of remote sensing image scene classification and survey (1) Autoencoder-based remote sensing image scene classification methods, (2) Convolutional Neural Network-based remote sensing image scene classification methods, and (3) Generative Adver- sarial Network-based remote sensing image scene classification methods. In addition, we introduce the benchmarks used for remote sensing image scene classification and summarize the performance of more than two dozen of representative algorithms on three commonly-used benchmark data sets. Finally, we discuss the promising opportunities for further research. Index Terms—Deep learning, remote sensing image, scene classification. I. I NTRODUCTION R EMOTE sensing images, a valuable data source for earth observation, can help us to measure and observe detailed structures on the Earth’s surface. Thanks to the advances of earth observation technology [1], [2], the volume of remote sensing images is drastically growing. This has given particular urgency to the quest for how to make full use of ever-increasing remote sensing images for intelligent earth observation [3], [4]. Hence, it is extremely important to understand huge and complex remote sensing images. As a key and challenging problem for effectively interpreting remote This work was supported in part by the Science, Technology and Innovation Commission of Shenzhen Municipality under Grant JCYJ20180306171131643, in part by the National Science Foundation of China under Grants 61772425 and 61773315, and in part by the Fundamental Research Funds for the Central Universities under Grant 3102019AX09 (J. Han is the corresponding author). G. Cheng and X. Xie are with the Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China, and also with the School of Automation, Northwestern Polytechnical University, Xi’an 710129, China. J. Han and L. Guo are with the School of Automation, Northwestern Polytechnical University, Xi’an 710129, China. G.-S. Xia is with the School of Computer Science, Wuhan University, Wuhan 430072, China. Manuscript received xxx xx, xxx. Fig. 1: Illustration of remote sensing image scene classifica- tion, which aims at labeling each remote sensing image patch with a semantic class based on its content. sensing imagery, scene classification of remote sensing images has been an active research area. Remote sensing image scene classification is to correctly label given remote sensing images with predefined semantic categories, as shown in Fig. 1. For the last few decades, extensive researches on remote sensing image scene classification have been undertaken driven by its real-world applications, such as urban planning [5], [6], natural hazards detection [7]–[9], environment monitoring [10]–[12], vegetation mapping [13], [14], and geospatial object detection [15]–[22]. With the improvement of spatial resolution of remote sens- ing images, remote sensing image classification gradually formed three parallel classification branches at different lev- els: pixel-level, object-level, and scene-level classification, as shown in Fig. 2 and Fig. 3. Here, it is worth mentioning that we use the term of “remote sensing image classification” as a general concept, which includes pixel-level, object-level, and scene-level classification of remote sensing images. To be specific, in the early literatures, researchers mainly focused on classifying remote sensing images at pixel level or subpixel level [23]–[25], through labeling each pixel in the remote sensing images with a semantic class, because the spatial resolution of remote sensing images is very low–the size of a pixel is similar to the sizes of the objects of interest [26]. To date, pixel-level remote sensing image classification (sometimes also called semantic segmentation, as shown in Fig. 3 (a)) is still an active research topic in the areas of multispectral and hyperspectral remote sensing image analysis [27]–[31]. arXiv:2005.01094v2 [cs.CV] 25 Jun 2020

Upload: others

Post on 22-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 1

Remote Sensing Image Scene Classification MeetsDeep Learning: Challenges, Methods, Benchmarks,

and OpportunitiesGong Cheng, Xingxing Xie, Junwei Han, Senior Member, IEEE, Lei Guo, Gui-Song Xia, Senior Member, IEEE

Abstract—Remote sensing image scene classification, whichaims at labeling remote sensing images with a set of semanticcategories based on their contents, has broad applications ina range of fields. Propelled by the powerful feature learningcapabilities of deep neural networks, remote sensing image sceneclassification driven by deep learning has drawn remarkableattention and achieved significant breakthroughs. However, to thebest of our knowledge, a comprehensive review of recent achieve-ments regarding deep learning for scene classification of remotesensing images is still lacking. Considering the rapid evolution ofthis field, this paper provides a systematic survey of deep learningmethods for remote sensing image scene classification by coveringmore than 160 papers. To be specific, we discuss the mainchallenges of remote sensing image scene classification and survey(1) Autoencoder-based remote sensing image scene classificationmethods, (2) Convolutional Neural Network-based remote sensingimage scene classification methods, and (3) Generative Adver-sarial Network-based remote sensing image scene classificationmethods. In addition, we introduce the benchmarks used forremote sensing image scene classification and summarize theperformance of more than two dozen of representative algorithmson three commonly-used benchmark data sets. Finally, we discussthe promising opportunities for further research.

Index Terms—Deep learning, remote sensing image, sceneclassification.

I. INTRODUCTION

REMOTE sensing images, a valuable data source forearth observation, can help us to measure and observe

detailed structures on the Earth’s surface. Thanks to theadvances of earth observation technology [1], [2], the volumeof remote sensing images is drastically growing. This hasgiven particular urgency to the quest for how to make fulluse of ever-increasing remote sensing images for intelligentearth observation [3], [4]. Hence, it is extremely important tounderstand huge and complex remote sensing images. As a keyand challenging problem for effectively interpreting remote

This work was supported in part by the Science, Technologyand Innovation Commission of Shenzhen Municipality under GrantJCYJ20180306171131643, in part by the National Science Foundation ofChina under Grants 61772425 and 61773315, and in part by the FundamentalResearch Funds for the Central Universities under Grant 3102019AX09 (J.Han is the corresponding author).

G. Cheng and X. Xie are with the Research & Development Instituteof Northwestern Polytechnical University in Shenzhen, Shenzhen 518057,China, and also with the School of Automation, Northwestern PolytechnicalUniversity, Xi’an 710129, China.

J. Han and L. Guo are with the School of Automation, NorthwesternPolytechnical University, Xi’an 710129, China.

G.-S. Xia is with the School of Computer Science, Wuhan University,Wuhan 430072, China.

Manuscript received xxx xx, xxx.

Fig. 1: Illustration of remote sensing image scene classifica-tion, which aims at labeling each remote sensing image patchwith a semantic class based on its content.

sensing imagery, scene classification of remote sensing imageshas been an active research area. Remote sensing image sceneclassification is to correctly label given remote sensing imageswith predefined semantic categories, as shown in Fig. 1. Forthe last few decades, extensive researches on remote sensingimage scene classification have been undertaken driven by itsreal-world applications, such as urban planning [5], [6], naturalhazards detection [7]–[9], environment monitoring [10]–[12],vegetation mapping [13], [14], and geospatial object detection[15]–[22].

With the improvement of spatial resolution of remote sens-ing images, remote sensing image classification graduallyformed three parallel classification branches at different lev-els: pixel-level, object-level, and scene-level classification, asshown in Fig. 2 and Fig. 3. Here, it is worth mentioningthat we use the term of “remote sensing image classification”as a general concept, which includes pixel-level, object-level,and scene-level classification of remote sensing images. To bespecific, in the early literatures, researchers mainly focused onclassifying remote sensing images at pixel level or subpixellevel [23]–[25], through labeling each pixel in the remotesensing images with a semantic class, because the spatialresolution of remote sensing images is very low–the sizeof a pixel is similar to the sizes of the objects of interest[26]. To date, pixel-level remote sensing image classification(sometimes also called semantic segmentation, as shown inFig. 3 (a)) is still an active research topic in the areas ofmultispectral and hyperspectral remote sensing image analysis[27]–[31].

arX

iv:2

005.

0109

4v2

[cs

.CV

] 2

5 Ju

n 20

20

Page 2: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 2

Fig. 2: A road map of remote sensing image classification. With the improvement of spatial resolution of remote sensingimages, remote sensing image classification gradually formed three parallel classification branches at different levels: pixel-level, object-level, and scene-level classification. Here, it is worth mentioning that we use “remote sensing image classification”as a general concept.

Fig. 3: Three levels of remote sensing image classification:(a) pixel-level remote sensing image classification focuseson labeling each pixel with a class; (b) object-level remotesensing image classification aims at recognizing objects inremote sensing images; (c) scene-level remote sensing imageclassification seeks to classify each given remote sensingimage patch into a semantic class. This survey focuses onscene-level remote sensing image classification.

Due to the advance of remote sensing imaging, the spatialresolution of remote sensing images is increasingly finer thancommon objects of interest, such that single pixels lose theirsemantic meanings. In such case, it is not feasible to recognizescene images at the pixel level solely and so per-pixel analysisbegan to be viewed with increasing dissatisfaction. In 2001,

Blaschke and Strobl [32] questioned the dominance of per-pixel research paradigm and concluded that analyzing remotesensing images at the object level is more efficient than per-pixel analysis. They suggested that researchers should payattention to object-level analysis, which aims at recognizingobjects in remote sensing images, as shown in Fig. 3 (b),where the term “object” refers to meaningful semantic entitiesor scene units. Subsequently, a series of approaches to analyzeremote remote sensing images at object level has dominatedremote sensing image analysis for the last two decades [33]–[36]. Amazing achievements of certain specific land use iden-tification tasks have been accomplished by pixel-level andobject-level classification algorithms.

However, remote sensing images may contain different anddistinct object classes because of the increasing resolutions ofremote sensing images. Pixel-level and object-level methodsmay not be sufficient to always classify them correctly. Underthe circumstances, it is of considerable interest to understandthe global contents and meanings of remote sensing images.A new paradigm of scene-level analysis of remote sensing im-ages has been recently suggested. Scene-level remote sensingimage classification, namely remote sensing image scene clas-sification, seeks to classify each given remote sensing imagepatch (e.g., 256×256) into a semantic class, as illustrated inFig. 3 (c). Here the item “scene” represents an image patchcropped from a large-scale remote sensing image that containsclear semantic information on the earth surface [37], [38].

It is a significant step to be able to represent visual data withdiscriminative features in almost all tasks of computer vision.The remote sensing domain is no exception. During the previ-

Page 3: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 3

ous decade, extensive efforts have been devoted to developingdiscriminative visual features. A majority of early remotesensing image scene classification methods relied on human-engineering descriptors, e.g., Scale-Invariant Feature Transfor-mation (SIFT) [39], Texture Descriptors (TD) [40]–[42], ColorHistogram (CH) [43], Histogram of Oriented Gradients (HOG)[44], and GIST [45]. Owing to their characteristic of beingable to represent an entire image with features, it is feasibleto directly apply CH, GIST and TD to remote sensing imagescene classification. However, SIFT and HOG cannot representan entire image directly because of their local characteristic.To make handcrafted local descriptors represent an entirescene image, these local descriptors are encoded by certainencoding methods (e.g., the Improved Fisher Kernel (IFK)[46], Vector of Locally Aggregated Descriptors (VLADs) [47],Spatial Pyramid Matching (SPM) [48], and the popular Bag-Of-Visual-Words (BoVW) [49]). Thanks to the simplicity andefficiency of these feature encoding methods, they have beenbroadly applied to the field of remote sensing image sceneclassification [50]–[55], whereas the representation capabilityof handcrafted features is limited.

In this case, unsupervised learning, such as k-means clus-tering, Principal Component Analysis (PCA) [56], and sparsecoding [57], which automatically learns features from un-labeled images, become an appealing alternative to human-engineering features. A considerable amount of unsupervisedlearning-based scene classification methods have emerged[58]–[66], and made substantial progress for scene classifi-cation. Nevertheless, these unsupervised learning approachescannot make full use of data class information.

Fortunately, due to the advances in deep learning theoryand the increased availability of remote sensing data andparallel computing resources, deep learning-based algorithmshave increasingly prevailed the area of remote sensing imagescene classification. In 2006, Hinton and Salakhutdinov [67]created an approach to initialize the weights for trainingmultilayer neural networks, which builds a solid foundation forthe development of deep learning later. During the period 2006to 2012, simple deep learning models have been developed(e.g., deep belief nets [68], autoencoder [67], and stackedautoencoder [69]).

The feature description capabilities of these simple deeplearning models have been demonstrated in many fields,involving remote sensing image scene classification. Sincethe AlexNet, a deep Convolutional Neural Network (CNN)designed by Krizhevskey et al. [70] in 2012, obtained thebest results in the Large-Scale Visual Recognition Challenge(LSVRC) [71], a great many advanced deep CNNs have comeforth and broken a number of records in many fields. In thewake of these successes, CNN-based methods have emergedin remote sensing image scene classification [72]–[74] andachieved advanced classification accuracy.

Nevertheless, CNN-based methods generally demand mas-sive annotated training data, which greatly limits their ap-plication scenarios. More recently, Generative AdversarialNetworks (GANs) [82], a promising unsupervised learningmethod, have achieved significant success in many applica-tions. To remedy the above-mentioned limitations, GANs have

Fig. 4: The number of publications in remote sensing imagescene classification from 2012 to 2019. Data from googlescholar advanced search: allintitle: (“remote sensing” or“aerial” or “satellite” or “land use”) and “scene classification”.

been employed by some researchers on the field of remotesensing image scene classification [83], [84].

Currently, driven by deep learning, a great number ofmethods of remote sensing image scene classification havesprung up (see Fig. 4). The number of papers in remotesensing image scene classification dramatically increased after2014 and 2017 respectively. There are two reasons for theincrease. On one hand, around 2014, deep learning techniquesbegan to be applied to remote sensing data analysis. Onthe other hand, in 2017, large-scale remote sensing imagescene classification benchmarks appeared, which have greatlyfacilitated the development of deep learning-based remotesensing image scene classification.

In the past several years, numerous reviews of remotesensing image classification methods have been published,which are summarized in Table I. For example, Tuia et al.[25] surveyed, tested and compared three active learning-basedremote sensing image scene classification methods: committee,large margin, and posterior probability. GoChova et al. [2]surveyed multimodal remote sensing image classification andsummarized the leading algorithms for this field. In [78],Maulik et al. conducted a review of remote sensing imagescene classification algorithms based on support vector ma-chine (SVM). Li et al. [75] surveyed the pixel-level, subpixel-level and object-based methods of image classification andemphasized the contribution of spatio-contextual informationto remote sensing image scene classification.

As an alternative ways to extract robust, abstract and high-level features from images, deep learning models have madeamazing progress on a broad range of tasks in processingimage, video, speech and audio. After this, a number of deeplearning-based scene classification algorithms were proposed,such as CNN-based methods and GAN-based methods. Anumber of reviews of scene classification approaches havebeen published. Penatti et al. [85] assessed the generalizationability of pre-trained CNNs in classification of remote sensingimages. In [38], Hu et al. surveyed how to apply the CNNsthat trained on the ImageNet data set to remote sensingimage scene classification. Zhu et al. [77] presented a tutorial

Page 4: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 4

1

TABLE I: Summarization of a number of surveys of remote sensing image analysis.No. Survey Title Year Publication Content

1 A survey of active learning algorithms forsupervised remote sensing image classification [25]

2011 IEEE JSTSP Surveying and testing the main families of activelearning methods

2A review of remote sensing image classification

techniques: the role of spatio-contextualinformation [75]

2014 EuJRS

Review of pixel-wise, subpixel-wise andobject-based methods for remote sensing imageclassification and exploring the contribution of

spatio-contextual information to scene classification

3 Multimodal classification of remote sensing images:a review and future directions [2]

2015 Proceedings of theIEEE

Offering a taxonomical view of the field ofmultimodal remote sensing image classification

4 Deep learning for remote sensing data: A technicaltutorial on the state of the art [76]

2016 IEEE GRSM Reviewing deep learning-based remote sensing dataanalysis techniques before 2016

5 Deep learning in remote sensing: A comprehensivereview and list of resources [77]

2017 IEEE GRSM Reviewing the progress of deep learning-basedremote sensing data analysis before 2017

6 Advanced spectral classifiers for hyperspectralimages: A review [27]

2017 IEEE GRSM Review and comparison of different supervisedhyperspectral classification methods

7 Remote sensing image classification: a survey ofsupport-vector-machine-based advanced techniques [78]

2017 IEEE GRSM Review of remote sensing image classificationbased on SVM

8AID: a benchmark data set for performanceevaluation of remote sensing image scene

classification [79]2017 IEEE TGRS Review of aerial image scene classification methods

before 2017 and proposing the AID data set

9 Remote sensing image scene classification:benchmark and state of the art [80]

2017 Proceedings of theIEEE

Reviewing the progress of scene classification ofremote sensing images before 2017 and proposing

the NWPU-RESISC45 data set

10Recent advances on spectral–spatial hyperspectral

image classification: An overview and newguidelines [28]

2017 IEEE TGRS Survey of the progress in the classification ofspectral–spatial hyperspectral images

11 Deep learning for hyperspectral imageclassification: An overview [29]

2019 IEEE TGRS Review of hyperspectral image classification basedon deep learning

12 Deep learning in remote sensing applications: Ameta-analysis and review [81]

2019 ISPRS JPRS Providing a review of the applications of deeplearning in remote sensing image analysis

13Remote sensing image scene classification meetsdeep learning: challenges, methods, benchmarks,

and opportunities2020 IEEE JSTARS

A systematic review of recent advances inremote sensing image scene classification driven

by deep learning

about deep learning-based remote sensing data analysis. Inorder to make full use of pre-trained CNNs, Nogueira et al.[86] analyzed the performance of CNNs for remote sensingimage scene classification with different learning strategies:full training, fine tuning, and using CNNs as feature extractors.In [76], Zhang et al. reviewed the recent deep learning-based remote sensing data analysis. Considering the numberof scene categories and the accuracy saturation of the existingscene classification data sets, Cheng et al. [80] released alarge-scale scene classification benchmark, named NWPU-RESISC45, and provided a survey of recent advance in remotesensing image scene classification before 2017. In [79], Xia etal. proposed a novel benchmark, called AID, for aerial imageclassification and reviewed the existing methods of sceneclassification before 2017. Ma et al. [81] provided a reviewof the applications of deep learning in remote sensing imageanalysis. In addition, there have been several hyperspectralimage classification surveys [27]–[29].

However, a thorough survey of deep learning for sceneclassification is still lacking. This motivates us to deeplyanalyze the main challenges faced for remote sensing imagescene classification, systematically review those deep learning-based scene classification approaches, most of which arepublished during the last five years, introduce the mainstreamscene classification benchmarks, and discuss several promising

future directions of scene classification.The remainder of this paper is organized as follows. Section

II discusses the current main challenges of remote sensingimage scene classification. A brief review of deep learningmodels and a comprehensive survey of deep learning-basedscene classification methods are provided in Section III. Thescene classification data sets are introduced in Section IV. InSection V the comparison and discussion of the performanceof deep learning-based scene classification methods on threewidely used scene classification benchmarks are given. InSection VI, we discuss the promising future directions of sceneclassification. Finally, we conclude this paper in Section VII.

II. MAIN CHALLENGES OF REMOTE SENSING IMAGESCENE CLASSIFICATION

The ideal goal of scene classification of remote sensingimages is to correctly label the given remote sensing imageswith their corresponding semantic classes according to theircontents, for example, categorizing a remote sensing imagefrom urban into residential, commercial, or industrial area.Generally speaking, a remote sensing image contains a varietyof ground objects. For instance, roads, trees, and buildingsmay be included in an industrial scene. Different from object-oriented classification, scene classification is a considerablychallenging problem because of the variance and complex

Page 5: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 5

spatial distributions of ground objects existing in the scenes.Historically, extensive studies of remote sensing image sceneclassification have been made. However, there has not yet beenan algorithm that can achieve the goal of classifying remotesensing image scenes with satisfactory accuracy. The chal-lenges of remote sensing image scene classification include(1) big intraclass diversity, (2) high interclass similarity (alsoknown as low between-class separability), (3) large varianceof object/scene scales, and (4) coexistence of multiple groundobjects, as shown in Fig. 5.

In terms of within-class diversity, the challenge mainlystems from the large variations in the appearances of groundobjects within the same semantic class. Ground objects com-monly vary in style, shape, scale, and distribution, whichmakes it difficult to correctly classify the scene images. Forexample, in Fig. 5 (a), the churches appear in different buildingstyles, and the airports and railway stations show in differentshapes. In addition, when airborne or space platforms captureremote sensing images, there may be large differences in colorand radiation intensity appearing within the same semanticclass on account of the imaging conditions, which can beinfluenced by the factors such as weather, cloud, mist, etc. Thevariations in scene illumination may also cause within-classdiversity, for example, the appearances of the scene labeledas “beach” show large differences under different imagingconditions, as shown in Fig. 5 (a).

For between-class similarity, the challenge is chiefly causedby the presence of the same objects within different sceneclasses or the high semantic overlapping between scene cat-egories. For instance, in Fig. 5 (b), the scene classes ofbridge and overpass both contain the same ground objects,namely bridge, and the basketball courts and tennis courtsshare high semantic information. Moreover, the ambiguousdefinition of scene classes degenerates inter-class dissimilarity.Some complex scenes are also similar with each other interms of their visual contents. Therefore, it may be extremelydifficult to distinguish these scene classes.

The large variance of object/scene scales is also a non-negligible challenge for remote sensing image scene classi-fication. In remote sensing imaging, sensors operate at theorbits of various altitudes, from a few hundred kilometers tomore than ten thousand kilometers, which leads to imagingaltitude variation. With the examples illustrated in Fig. 5 (c),the scenes of airplane, storage tank, and thermal power stationhave huge scale differences under different imaging altitudes.In addition, because of some intrinsic factors, the variations insize for each object/scene category can also exist, for example,the rivers shown in Fig. 5 (c) are presented in several differentsub-scenes—stream, brook, and creek.

Moreover, owing to the complex and diverse distribution ofground objects and the wide birds-eye perspective of remotesensing imaging equipments, it is quite common that multipleground objects appear in a single remote sensing image. Asillustrated in Fig. 5 (d), the scenes of commercial areas maycontain buildings, cars, rivers, roads, parking lots, meadows,swimming pools, and playgrounds; roads, trees, bridges, rivers,and cars can coexist in the scenes of industrial areas; the scenesof ground track fields may accompany with the presence of

swimming pools, cars, roads, meadows, and trees; the scenesof freeways contain meadows, trees, buildings, cars, rivers,bridges, forests, parking lots, etc. Faced with the situation,it is difficult for single-label remote sensing image sceneclassification to provide deep understanding for the contentsof remote sensing images.

III. SURVEY ON DEEP LEARNING-BASED REMOTESENSING IMAGE SCENE CLASSIFICATION METHODS

In the past decades, many researchers have committed toscene classification of remote sensing images, driven by itswide applications. A number of advanced scene classificationsystems or approaches have been proposed, especially drivenby deep learning. Before deep learning came to the attentionof this field, scene classification methods mainly relied onhandcrafted features (e.g., Color Histogram (CH), texturedescriptors (TD), GIST) or the representations generated byencoding local features via BoVW, IFK, SPM, etc. Later,considering that handcrafted features only extract low-levelinformation, many researchers turned to looking at unsuper-vised learning methods (e.g., sparse coding, PCA, and k-means). By automatically learning discriminative features fromunlabeled data, unsupervised learning-based methods haveobtained good results in the scene classification of remotesensing images. Yet, unsupervised learning-based algorithmsdo not adequately exploit data class information, which limitstheir abilities to discriminate between different scene classes.Now, thanks to the availability of enormous labeled data,the advances in machine learning theory and the increasedavailability of computational resources, deep learning models(e.g., autoencoder, CNNs, and GANs) have shown powerfulabilities to learn fruitful features and have permeated manyresearch fields, including the area of remote sensing imagescene classification. Currently, numerous deep learning-basedscene classification algorithms have emerged and have yieldedthe best classification accuracy. In this section, we systemati-cally survey about 50 deep learning-based algorithms for sceneclassification of remote sensing images. In Fig. 6, we presentsome milestone works. That is one small step for deep learningtheory, but one giant leap for the scene classification of remotesensing images [87]. From autoencoder, to CNNs, and thento GANs, deep learning algorithms constantly update sceneclassification records. To sum up, most of the deep learning-based scene classification algorithms can be broadly dividedinto three main categories: autoencoder-based methods, CNN-based methods, and GAN-based methods. In what follows, wediscuss the three categories of methods at great length.

A. Autoencoder-Based Remote Sensing Image Scene Classifi-cation

1) Brief introduction of autoencoder: Autoencoder [67] isan unsupervised feature learning model, which consists ofa sort of shallow and symmetrical neural network (see Fig.7 (a)). An autoencoder consists of three layers: input layer,hidden layer, and output layer. It contains two units—encoderand decoder. The transformation from input layer to hiddenlayer is the process of encoding. The process of encoding can

Page 6: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 6

Fig. 5: Challenges of remote sensing image scene classification, which include (a) big within-class diversity, (b) high between-class similarity (also known as low between-class separability), (c) large variance of object/scene scales, and (d) coexistenceof multiple ground objects. These images are from the NWPU-RESISC45 data set [80].

be formulated as equation (1), where h ∈ Rn is the output ofhidden layers, f denotes a nonlinear mapping, W ∈ Rn×m

stands for the encoding weight matrix, x ∈ Rm denotes theinput of autoencoder, and b ∈ Rn is the bias vector. Decoding

Page 7: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 7

Fig. 6: Milestones of deep learning-based remote sensing image scene classification, including different deep learning-basedmethods and data sets. The red line represents typical data sets. The green, blue, and orange lines stand for Autoencoder-based,CNN-based, and GAN-based remote sensing image scene classification, respectively.

Fig. 7: The architectures of (a) autoencoder and (b) stacked autoencoder. The red, yellow, and green nodes stand for the hiddenlayers of autoencoders AE1, AE2, and AE3, respectively. When stacking these autoencoders, the output of the hidden layerof the previous autoencoder is the input of the following autoencoder. For example, the output of the hidden layer of AE1 isthe input of AE2, and the output of the hidden layer of AE2 is the input of AE3.

is the inverse of encoding, which is the transformation fromhidden layer to output layer, and can be formulated as equation(2), where x ∈ Rm represents the reconstructed output, thedecoding weight matrix is denoted by W ′ ∈ Rm×n, andb′ ∈ Rm stands for the bias vector.

h = f(W · x+ b) (1)

x = f (W ′ · h+ b′) (2)

Autoencoder is able to compress high dimensional featuresby minimizing the cost function that usually consists of areconstruction error term and a regularization term. By usinggradient descent with back propagation, autoencoder can learnthe parameters of networks. In real applications, multilayerstacked autoencoders are used (see Fig. 7 (b)) for feature learn-ing. For example, three individual autoencoders AE1, AE2,and AE3 are stacked together to form a stacked autoencoder,as shown in Fig. 7 (b). When stacking these autoencoders, the

output of the hidden layer of the previous autoencoder is theinput of the following autoencoder. For example, the output ofthe hidden layer of AE1 is the input of AE2, and the outputof the hidden layer of AE2 is the input of AE3. The key totraining stacked autoencoders is how to initialize the network.The way of initializing the parameters of networks influencesthe network convergence especially the early layers, as well asthe stability of training. Fortunately, Hinton et al [67] provideda good solution to initialize the weight of the network by usingrestricted Boltzmann machines.

2) Autoencoder-based scene classification methods: Au-toencoder is able to automatically learn mid-level visualrepresentations from unlabeled data. The mid-level featuresplays an important role in remote sensing image scene classi-fication before deep learning takes off in the remote sensingcommunity. Zhang et al. [88] introduced sparse autoencoderto scene classification. Cheng et al. [89] used the single-hidden-layer neural network and autoencoder for training more

Page 8: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 8

Fig. 8: The architecture of CNNs.

effective sparselets [90] to achieve efficient scene classificationand object detection. In [91], Othman et. al proposed anremote sensing image scene classification algorithm reliedon convolutional features and a sparse autoencoder. Hanet al. [92] provided the scene classification methods basedon hierarchical convolutional sparse autoencoder. Cheng etal. [93] demonstrated mid-level visual feature learned fromautoencoder-based method is discriminative and able to fa-cilitate scene classification tasks. In light of the limitationof feature representation of a single autoencoder, some re-searchers stacked multiple autoencoders together. Du et al.[94] came up with stacked convolutional denoising autoen-coder networks. After extensive experiments, their proposedframework showed superior classification performance. Yaoet al. [95] integrated pairwise constraints into a stacked sparseautoencoder to learn more discriminative features for land-usescene classification and semantic annotation tasks.

The autoencoder and the algorithms derived from autoen-coder are unsupervised-learning methods and have obtainedgood results in scene classification of remote sensing im-ages. However, most of the above-mentioned autoencoder-based methods cannot learn the best discrimination features todistinguish different scene classes because they do not fullyexploit scene class information.

B. CNN-Based Remote Sensing Image Scene Classification

1) Brief introduction of CNN: CNNs have shown powerfulfeature learning ability in the visual domain. Since Krizhevskyand Hinton proposed the Alexnet [70] in 2012, a deep CNNthat obtained the best accuracy in the LSVRC, there haveappeared an array of advanced CNN models, such as VGGNet[96], GoogleNet [97], ResNet [98], DensNet [99], SENet[100], and SKNet [101]. CNNs are a kind of multi-layernetwork with learning ability that consists of convolutionallayers, pooling layers, and fully connected layers (see Fig. 8).

(1) Convolutional layersConvolutional layers play an important role on feature

extraction from images. The convolutional layers input X ∈

Rn×w×h consists of n two-dimensional feature maps of sizew × h . The output H ∈ Rm×w′×h′

of convolutional layersis m two-dimensional feature maps of size w′ × h′ viaconvolutional kernels W . W ∈ Rm×l×l×n is m trainablefilters of size l × l × n (typically l =1, 3 or 5). The entireprocess of convolution is described as equation (3), where ∗denotes two-dimensional convolution operation, additionallyby using b to denote the m dimensional bias term. In general,a non-linear activation function f is performed after convo-lution operation. As the convolutional structure deepens, theconvolutional layers can capture different level features (e.g.,edges, lines, corners, structures, and shapes) from the inputfeature maps.

H = f(W ∗X + b) (3)

(2) Pooling layersPooling layers are to execute a max or average operation

over a small aera of each input feature map, which can bedefined as equation (4), where pool represents the poolingfunction (e.g., average pooling, max pooling, and stochasticpooling), Hl−1 and Hl denotes the input and output ofthe pooling layer respectively. Usually, pooling layers areapplied between two successive convolutional layers. Poolingoperation can create invariance, such as small shifts anddistortions. In the object detection and scene classificationtasks, the characteristic of invariance provided by poolinglayers is very important.

Hl = pool(Hl−1) (4)

(3) Fully connected layersFully connected layers usually appear in the top layer of

CNNs, which can summarize the features extracted from thebottom layers. Fully connected layers process its input Xwith linear transformation by weight W and bias b, then mapthe output of linear transformation by a non-linear activationfunction f . The entire process can be formulated as equation(5). In the task of classification, to output the probability ofeach class, a softmax classifier is connected to the last fullyconnected layer generally. The softmax classifier is used to

Page 9: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 9

Fig. 9: The architecture of GANs.

normalize the fully connected layer output y ∈ Rc (c is thenumber of classes) between 0 and 1, which can be described asequation (7), where e is the exponential function. The output ofsoftmax classifier denotes the probability that the given inputimage belongs to each class. The dropout method [59] operateson the fully connected layers to avoid overfitting becausea fully connected layer usually contains a large number ofparameters.

y = f(W · X + b) (5)

P (yi) =eyi∑ci=1 e

yi(6)

2) CNN-based scene classification methods: In the wakeof CNNs successfully being applied to large-scale visualclassification tasks, around 2015, the use of CNNs has finallytaken off in the remote sensing image analysis field [76], [77].Compared with traditional advanced methods, e.g., SIFT [39],HOG [44], and BoVW [49], CNNs have the advantage of end-to-end feature learning. Meanwhile, it can extract high-levelvisual features that handcrafted feature-based methods cannotlearn. By using different strategies of exploiting CNNs, a va-riety of CNN-based scene classification methods [73], [102]–[107] have emerged. Generally, the CNN-based methods ofremote sensing image scene classification can be divided intothree groups: using pre-trained CNNs as feature extractors,fine-tuning pre-trained CNNs on target data sets, and trainingCNNs from scratch.

(1) Using pre-trained CNNs as feature extractorsIn the beginning, CNNs appeared as feature extractors. Pe-

natti et al. [85] introduced CNNs in 2015 into remote sensingimage scene classification, and evaluated the generalizationcapability of off-the-shelf CNNs in classification of remotesensing images. Their experiments show that CNNs can obtainbetter results than low-level descriptors. Later, Hu et al. [38]

treated CNNs as feature extractors and investigated how tomake full use of pre-trained CNNs for scene classification.In [108], Marmanis et al. introduced a two-stage CNN sceneclassification framework. It used pre-trained CNNs to derive aset of representations from images. The extracted represen-tations were then fed into shallow CNN classifiers. Chaibet al. [109] fused the deep features extracted with VGGNetto enhance scene classification performance. In [110], Li etal. fused pre-trained CNN features. The fused CNN featuresshow better discrimination than raw CNN features in sceneclassification. Cheng et al. [104] designed the BoCF (bagof convolutional features) for remote sensing image sceneclassification by using off-the-shelf CNN features to replacetraditional local descriptors such as SIFT. Yuan et al. [111]rearranged the local features extracted by an already trainedVGG19Net for remote sensing image scene classification. In[112], He et al. proposed a novel multilayer stacked covariancepooling algorithm (MSCP) for remote sensing image sceneclassification. MSCP can combine multilayer feature mapsextracted from pre-trained CNN automatically. Lu et al. [113]introduced an feature aggregation CNN (FACNN) for sceneclassification. FACNN learns scene representations throughexploring semantic label information. These methods all usedpre-trained CNNs as feature extractors and then fused orcombined the features extracted by existing CNNs. It is worthnoticing that the strategy of using off-the-shelf CNNs asfeature extractors is simple and effective on small-scale datasets.

(2) Fine-tuning pre-trained CNNsHowever, when the amount of training samples is not

adequate to train a new CNN from scratch, fine-tuning analready trained CNNs on target data sets is a good choice.Castelluccio et al. [114] delved into the use of CNNs forremote sensing image scene classification by experimenting

Page 10: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 10

with three learning approaches: using pre-trained CNNs asfeature extractors, fine tuning, and training from scratch. Andthey concluded that fine-tuning gave better results than fulltraining when the scale of data sets is small. This maderesearchers interested in fine-adjusting scene classificationnetworks or optimizing its loss functions. Cheng et al. [73]designed a novel objective function for learning discriminativeCNNs (D-CNNs). The D-CNNs shows better discriminabilityin scene classification. In [115], Liu et al. coupled CNN with ahierarchical Wasseratein loss function (HW-CNNs) to improveCNNs discriminatory ability. Minetto et al. [72] devised a newremote sensing image scene classification framework, namedHydra, which is an ensembles of CNNs and achieves thebest results on the NWPU-RESISC45 data set. Wang et al.[74] introduced attention mechanism into CNNs and designedthe ARCNet ( attention recurrent convolutional network ) forscene classification. It is capable of highlighting key areasand discard noncritical information. In [116], to handle theproblem of object scale variation in scene classification, Liuet al. formulated the multiscale CNN (MCNN). Fang et al.[117] designed a robust space-frequency joint representation(RSFJR) for scene classification by adding a frequency domainbranch to CNNs. Because of fusing features from the spaceand frequency domains, the proposed method is able to providemore discriminative feature representations. Xie et al. [118]designed a scale-free CNN (SF-CNN) for the task of sceneclassification. SF-CNN can accept the images of arbitrarysize as input without any resizing operation. Sun et al. [119]proposed a gated bidirectional network (GBN) for sceneclassification, which can get rid of the interference informationand aggregate the interdependent information among differentCNN layers.In the above-mentioned methods, CNNs can learndiscriminative features and obtain better performance by fineadjusting their structures, optimizing their objective function,or fine-tuning the modified CNNs on the target data sets.

(3) Training CNNs from scratchEven though fine-tuning pre-trained CNNs can achieve

remarkable performance, there exist some limitations relyingon pre-trained CNNs: learned features are not fully suitablefor the characteristics of target data sets and it is inconvenientfor researchers to modify pre-trained CNNs. In [120], Chenet al. introduced knowledge distillation into scene classifi-cation to boost the performance of light CNNs. Zhang etal. [121] illustrated a lightweight and effective CNN thatintroduces the dilated convolution and channel attention intoMobilenetv2 [122] for scene classification. In addition, it isof considerable interest to design more effective and robustCNNs for scene classification. He et al. [123] introduced anovel skip-connected covariance (SCCov) network for remotesensing image scene classification. The SCCov is to add skipconnection and covariance pooling to CNNs, which can reducethe amount of parameters and achieve better classificationperformance. In [102], Zhang et al. presented a gradientboosting random convolutional network (GBRCN) for sceneclassification via assembling different deep neural networks.

These CNN-based methods have obtained astonishing sceneclassification results. However, they generally require numer-ous annotated samples to fine-tune already trained CNNs or

train a network from scratch.

C. GAN-Based Remote Sensing Image Scene Classification

1) Brief introduction of GAN: Generative Adversarial Net-work (GAN) [82] is another important and promising machinelearning method. As its name implies, GAN models the dis-tribution of data via adversarial learning based on a minimaxtwo-player game, and generates real-like data. GANs containa pair of components—the discriminator D and generatorG. As shown in Fig. 9, G can be analogues to a group ofcounterfeiters who take the role of generating fake currency,while D can be thought of as polices who determine whetherthe currency is made by G or bank. G and D constantly pitagainst each other in this game until D cannot distinguishbetween the counterfeit currency and genuine articles. GANssee the competition between G and D as the sole trainingcriterion. G takes an input z, which is a latent variableobeying a prior distribution pz(z), then maps z with noiseinto data space by using a differential function G (z;θg),where θg denotes the generator G’s parameters. D outputs theprobability of the input data x that comes from real data ratherthan generator through a mapping D (x;θd) with parametersθd, where θd denotes the discriminator D’s parameters. Theentire process of the two-player minimax game is describedas equation (7), where pdata is the distribution of data x andV(G,D) is an object function. From D’s perspective, given aninput data generated by G, D will play a role in minimizingits output. While if a sample is real data, D will maximize itsoutput. This is the reason why the term log(1−D(G(z))) isplugged into equation (7). Meanwhile, to fool D, G makes aneffort to maximize D’s output when a generated data is input toD. Thus the relationship that D wants to maximize V(G,D)and G struggles to minimize V(G,D) is formed.

minG

maxD

V (G,D) = Ex∼pdata(x)[logD(x)]

+Ez∼pt(z)[log(1−D(G(z)))](7)

2) GAN-based scene classification methods: As a keymethod for unsupervised learning, since the introduction byGoodfellow et al. [82] in 2014, GANs have been graduallyapplied to many tasks such as image to image translation,sample generation, image super-resolution, and so on. Facingthe tremendous volume of remote sensing images, CNN-based methods need to use massive labeled samples to trainmodels. However, annotating samples is labor-intensive. Someresearchers began to employ GANs to scene classification.In 2017, Lin et al. [84] proposed a multiple-layer feature-matching generative adversarial networks (MARTA GANs)for the task of scene classification. Duan et al. [83] used anadversarial net to assist in mining the inherent and discrimi-native features from remote sensing images. The dug featuresare able to enhance the classification accuracy. Bashmal et al.[132] provided a GAN-based method, called Siamese-GAN, tohandle the aerial vehicle images classification problems undercross-domain conditions. In [133], to generate high-qualityremote sensing images for scene classification, Xu et al. addedthe scaled exponential linear unites to GANs. Ma et al. [134]designed the SiftingGAN, which can generate a large variety

Page 11: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 11

TABLE II: 13 publicly available data sets for remote sensing image scene classification.

Data sets Image numberper class

Number ofscene classes

Total imagenumber

Image size Training ratios Data sources Year

UC Merced [49] 100 21 2100 256 × 256 50% 80% Aerial orthoimagery 2010

WHU-RS19 [124] 50∼61 19 1005 600×600 40% 60% Google Earth 2012

RSSCN7 [125] 400 7 2800 400×400 20% 50% Google Earth 2015

Brazilian Coffee Scene [85] 1438 2 2876 64×64 50% SPOT sensor 2015

SAT-4/-6 [126] 125000/67500 4/6 500000/405000 28×28 80% National AgricultureImagery Program

2015

SIRI-WHU [127] 200 12 2400 200×200 50% Google Earth 2016

RSC11 [128] about 100 11 1232 512×512 50% Google Earth 2016

AID [79] 220∼420 30 10000 600×600 20% 50% Google Earth 2017

NWPU-RESISC45 [80] 700 45 31500 256×256 10% 20% Google Earth 2017

RSI-CB128/-CB256 [129] about 800/690 45/35 36000/24000 128×128/256×256 50% 80% Google Earth & BingMaps

2017

OPTIMAL-31 [74] 60 31 1860 256×256 80% Google Earth 2018

EuroSAT [130] 2000∼3000 10 27000 64×64 80% Sentinel-2 2019

BigEarthNet [131] 328∼217119 44 590326 120×120 60% Sentinel-2 2019

of authentic annotated samples for scene classification. Teng etal. [135] presented a classifier-constrained adversarial networkfor cross-domain semi-supervised scene classification. Han etal. [136] introduced a generative framework, named SSGF, toscene classification. Yu et al. [137] devised an attention GANfor scene classification. Attention GAN achieves better sceneclassification performance by enhancing the representationpower of the discriminator.

In the area of remote sensing scene image classification,most of GAN-based methods usually use GANs for samplegeneration or feature learning in an adversarial manner. Com-pared with CNN-based scene classification methods, only asmall number of literatures about GAN-based scene classifi-cation method have been reported so far, and the performanceof GAN-based scene classification is inferior to CNN-basedmethods. In addition, most of GAN-based scene classificationmethods cannot be trained end-to-end because they oftenrequire labels for training an additional classifier. However, thepowerful self-supervised feature learning capacity of GANsprovides a promising future direction for scene classification.

IV. SURVEY ON REMOTE SENSING IMAGE SCENECLASSIFICATION BENCHMARKS

Data sets play an irreplaceable role on the advance of sceneclassification. Meanwhile, they are crucial for developingand evaluating various scene classification methods. As thenumber of high-resolution remote sensing sensors increases,the access to massive high-resolution remote sensing imagesmakes it possible to build large-scale scene classificationbenchmarks. In the past few years, the researchers fromdifferent groups have proposed several publicly available high-resolution benchmark data sets for scene classification ofremote sensing images [49], [74], [79], [80], [85], [124]–[131] to facilitate this field forward. Starting with the UC-Merced data set [49], some representative data sets include

WHU-RS19 [124], SAT-4&6 [126], RSSCN7 [125], BrazilianCoffee Scene [85], RSC11 [128], SIRI-WHU [127], RSCI-CB[129], AID [79], NWPU-RESISC45 [80], OPTIMAL-31 [74],EuroSAT [130], and BigEarthNet [131]. The characteristics ofthese 13 data sets are listed in Table II. Among them, the UC-Merced data [49], AID data set [79], and NWPU-RESISC45data set [80] are three commonly-used benchmark data sets,which will be introduced below in detail.

A. UC-Merced Data Set

The UC-Merced data set1 [49] was released in 2010 andcontains 21 scene classes. Each category consists of 100 land-use images. In total, the data set comprises 2100 scene images,of which the pixel resolution is 0.3 m. These images wereobtained from United States Geological Survey National Mapof 21 U.S. regions and fixed at 256 × 256 pixels. Fig. 10lists the samples of each category from the data set. Upto now, the data set continues to be broadly employed forscene classification. When conducting algorithm evaluation,two widely-used training ratios are 50% and 80%, and theremaining 50% and 20% are used for testing.

B. AID Data Set

The AID [79] data set2 is a relatively large-scale data setfor aerial scene classification. It was published in 2017 byWuhan University and consists of 30 scene classes. Each sceneclass consists of 220 to 420 images, which were croppedfrom Google Earth imagery and fixed at 600 × 600 pixels.In total, the data set comprises 10000 scene images. Fig.11 lists the samples of each category from the data set.Different from the UC-Merced data set, the AID data set

1http://weegee.vision.ucmerced.edu/datasets/form.html2www.lmars.whu.edu.cn/xia/AID-project.html

Page 12: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 12

Fig. 10: Some example images from the UC-Merced data set.

is multi-sourced because these aerial images were capturedwith different sensors. Moreover, the data set is also multi-resolution and the pixel resolution of each scene categoriesvaries from about 8 m to about 0.5 m. When conductingalgorithm evaluation, two widely-used training ratios are 20%and 50%, and the remaining 80% and 50% are used for testing.

C. NWPU-RESISC45 Data Set

To the best of our knowledge, the NWPU-RESISC45 dataset3 [80], released by Northwest Polytechnical University, iscurrently the largest scene classification data set. It consistsof 45 scene categories. Each category consists of 700 images,which were obtained from Google Earth and fixed at 256×256pixels. In total, the data set comprises 31500 scene images,which is chosen from more than 100 countries and regions.Apart from some specific classes with lower spatial resolution(e.g., island, lake, mountain, and iceberg), the pixel resolutionof most the scene categories varies from about 30 m to 0.2

3http://www.escience.cn/people/gongcheng/NWPU-RESISC45.html

m. Fig. 12 lists the samples of each category from the dataset. The release of NWPU-RESISC45 data set has alloweddeep learning models to develop their full potential. Whenconducting algorithm evaluation, two widely-used trainingratios are 10% and 20%, and the remaining 90% and 80%are used for testing.

V. PERFORMANCE COMPARISON AND DISCUSSION

A. Evaluation Criteria

There exist three commonly-used criteria for evaluatingthe performance of the task of remote sensing image sceneclassification: overall accuracy (OA), average accuracy (AA),and confusion matrix. The metric of OA is an evaluation ofthe performance of the classifiers over the entire test data set,which is formulated as the total number of accurately classifiedsamples Nc divided by the total number of tested samplesNt, as described in equation (8). OA is a commonly-usedcriterion for evaluating the performance of the methods forscene classification of remote sensing images. The criterion ofAA is defined as the sum of the accuracies of each category

Page 13: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 13

Fig. 11: Some example images from the AID data set.

Ai divided by the total number of class c, as described inequation (9). When the sample number of each category isequal on the test set, OA and AA have the same value. Theconfusion matrix is a detailed classification result table aboutthe performance of each single classifier. For each element xij

in the table, the proportion of the images that are predictedto be the i-th category while actually belonging to the j-thclass is computed. Therefore, the confusion matrix can directlyvisualize the performance of each category and through it wecan easily get which classifiers are getting it right and whattypes of errors they are making. In this survey we only useOA as evaluation criterion because the confusion matrix willtake a lot of space.

OA = Nc/Nt (8)

AA =1

c

c∑i=1

Ai (9)

B. Performance Comparison

In recent years, a variety of scene classification algorithmshave been published. Here, 27 deep learning-based scene clas-sification methods are selected for performance comparison onthree widely-used benchmark data sets. Among the 27 deeplearning methods, 3 of them are autoencoder-based methods,22 of them are CNN-based methods, and 2 of them are GAN-based methods.

Page 14: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 14

Fig. 12: Some example images from the NWPU-RESISC45 data set.

Tables III, IV, V report the classification accuracy compar-ison of deep learning-based scene classification methods onthe UC-Merced data set, the AID data set, and the NWPU-RESISC45 data set, respectively, measured in terms of OA.

C. Discussion

As can be seen from Tables III, IV, V, the performanceof remote sensing image scene classification has been succes-sively advanced. In the early days, deep learning-based sceneclassification approaches were mainly based on autoencoder,and researchers usually use the UC-Merced data set to evaluate

Page 15: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 15

TABLE III: Overall accuracy (%) comparison of 21 scene classification methods on the UC-Merced data set.

Method Year PublicationTraining ratio

50% 80%

Autoencoder-based

SGUFL [88] 2014 IEEE TGRS - 82.72±1.18

partlets-based method [37] 2015 IEEE TGRS 88.76±0.79 -

SCDAE [94] 2016 IEEE TCYB - 93.7±1.3

CNN-based

GBRCN [102] 2015 IEEE TGRS - 94.53

LPCNN [103] 2016 JARS - 89.90

Fusion by Addition [109] 2017 IEEE TGRS - 97.42±1.79

ARCNet-VGG16 [74] 2018 IEEE TGRS 96.81±0.14 99.12±0.40

MSCP [112] 2018 IEEE TGRS - 98.36±0.58

D-CNNs [73] 2018 IEEE TGRS - 98.93±0.10

MCNN [116] 2018 IEEE TGRS - 96.66±0.9

ADSSM [138] 2018 IEEE TGRS - 99.76±0.24

FACNN [113] 2019 IEEE TGRS - 98.81±0.24

SF-CNN [118] 2019 IEEE TGRS - 99.05±0.27

SCCov [123] 2019 IEEE TNNLS - 99.05±0.25

RSFJR [117] 2019 IEEE TGRS 97.21±0.65 -

GBN [119] 2019 IEEE TGRS 97.05±0.19 98.57±0.48

ADFF [139] 2019 Remote Sensing 96.05±0.56 97.53±0.63

CNN-CapsNet [140] 2019 Remote Sensing 97.59±0.16 99.05±0.24

Siamese ResNet50 [141] 2019 IEEE GRSL 90.95 94.29

GAN-basedMARTA GANs [84] 2017 IEEE GRSL 85.5±0.69 94.86±0.80

Attention GANs [137] 2019 IEEE TGRS 89.06±0.50 97.69±0.69

TABLE IV: Overall accuracy (%) comparison of 16 scene classification methods on the AID data set.

Method Year PublicationTraining ratio

20% 50%

CNN-based

Fusion by Addition [109] 2017 IEEE TGRS - 91.87±0.36

ARCNet-VGG16 [74] 2018 IEEE TGRS 88.75±0.40 93.10±0.55

MSCP [112] 2018 IEEE TGRS 91.52±0.21 94.42±0.17

D-CNNs [73] 2018 IEEE TGRS 90.82±0.16 96.89±0.10

MCNN [116] 2018 IEEE TGRS - 91.80±0.22

HW-CNNs [115] 2018 IEEE TGRS - 96.98±0.33

FACNN [113] 2019 IEEE TGRS - 95.45±0.11

SF-CNN [118] 2019 IEEE TGRS 93.60±0.12 96.66±0.11

SCCov [123] 2019 IEEE TNNLS 93.12±0.25 96.10±0.16

CNNs-WD [142] 2019 IEEE GRSL - 97.24±0.32

RSFJR [117] 2019 IEEE TGRS - 96.81±1.36

GBN [119] 2019 IEEE TGRS 92.20±0.23 95.48±0.12

ADFF [139] 2019 Remote Sensing 93.68±0.29 94.75±0.25

CNN-CapsNet [140] 2019 Remote Sensing 93.79±0.13 96.32±0.12

GAN-basedMARTA GANs [84] 2017 IEEE GRSL 75.39±0.49 81.57±0.33

Attention GANs [137] 2019 IEEE TGRS 78.95±0.23 84.52±0.18

autoencoder-based algorithms. As an early unsupervised deeplearning method, the structure of autoencoder was relativelysimple, so its feature learning capability was also limited. Theaccuracies of the autoencoder-based approaches had plateauedon the standard benchmarks.

Fortunately, after 2012, CNNs, a powerful supervised learn-ing method, have proved to be capable of learning abstractfeatures from raw images. Despite their powerful potential, ittook some time for CNNs to take off in the remote sensingimage scene classification domain, until 2015. A short while

Page 16: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 16

TABLE V: Overall accuracy (%) comparison of 15 scene classification methods on the NWPU-RESISC45 data set.

Method Year PublicationTraining ratio

10% 20%

CNN-based

BoCF [104] 2017 IEEE GRSL 82.65±0.31 84.32±0.17

MSCP [112] 2018 IEEE TGRS 88.07±0.18 90.81±0.13

D-CNNs [73] 2018 IEEE TGRS 89.22±0.50 91.89±0.22

HW-CNNs [115] 2018 IEEE TGRS - 94.38±0.17

IORN [143] 2018 IEEE GRSL 87.83±0.16 91.30±0.17

ADSSM [138] 2018 IEEE TGRS 91.69±0.22 94.29±0.14

SF-CNN [118] 2019 IEEE TGRS 89.89±0.16 92.55±0.14

ADFF [139] 2019 Remote Sensing 90.58±0.19 91.91±0.23

CNN-CapsNet [140] 2019 Remote Sensing 89.03±0.21 89.03±0.21

SCCov [123] 2019 IEEE TNNLS 89.30±0.35 92.10±0.25

DNE [144] 2019 IEEE GRSL - 96.01

Hydra [72] 2019 IEEE TGRS 92.44±0.34 94.51±0.21

Siamese ResNet50 [141] 2019 IEEE GRSL - 92.28

GAN-basedMARTA GANs [84] 2017 IEEE GRSL 68.63±0.22 75.03±0.28

Attention GANs [137] 2019 IEEE TGRS 72.21±0.21 77.99±0.19

later, CNN-based algorithms mainly used CNNs as featureextractors, which outperformed autoencoder-based methods.However, only using CNNs as feature extractors did not makefull use of the potential of CNNs. Thanks to the release of twolarge-scale scene classification benchmarks, namely AID andNWPU-RESISC45 in 2017, fine-tuning off-the-shelf CNNshave shown better generalization ability in the task of sceneclassification than only using CNNs as feature extractors.

Generally, CNN-based methods require large-scale labeledremote sensing images to train CNNs. To deal with thisissue, GANs, a novel self-supervised learning method, wasintroduced into remote sensing image scene classification.Through adversarial training, GANs can model the distributionof real samples and generate new samples. According to thereported accuracy of scene classification in Tables III, IV, V,the development of autoencoder-based methods have reached abottleneck, CNNs-based methods still dominate and have someupside potential, the performance of GAN-based methods isrelatively low on the three benchmarks, and so there remainsmuch room for further improving the performance of GAN-based methods.

Moreover, learning discriminative feature representation isone of the critical driving forces that improve scene classi-fication performance. Fusing multiple features [109], [117],designing effective cost functions [72], [115], modifying deeplearning models [72], [118], and data augmentation [84] are allbeneficial for attaining better performance. Meanwhile, withthe access to large-scale benchmark data sets, it will becomesmaller for the gap between the scene classification approachesbased on supervised learning and the scene classificationapproaches relied on unsupervised learning.

The release of publicly available benchmarks, such as theUC-Merced data set, the AID data set and the NWPU-

RESISC45 data set, makes it easier to compare scene clas-sification algorithms. From the perspectives of data sets, theUC-Merced data set is relatively simple, and the results on thedata set driven by CNNs have reached saturation (above 99%classification accuracy by using the training ratios of 80%).The AID data set is of moderate difficulty. The classificationaccuracy on the AID data set can reach about 97% by using50% training samples. For NWPU-RESISC45, some advancedmethods based on CNNs have reached about 96% classifica-tion accuracy when the training ratio is fixed at 20%. Up tothe present, the NWPU-RESISC45 data set is still challengingcompared with the UC-Merced data set and the AID data set.

The performance of CNN-based methods depends verymuch on the quantity of training data, so developing larger-scale and more challenging remote sensing image scene clas-sification benchmarks can further promote the development ofdata-driven algorithms.

VI. FUTURE OPPORTUNITIES

Scene classification is an important and challenging problemfor remote sensing image interpretation. Driven by its wideapplication, it has aroused extensive research attention. Thanksto the advancement of deep learning techniques and the estab-lishment of large-scale data sets for scene classification, sceneclassification has been seeing dramatic improvement. In spiteof the amazing successes obtained in the past several years,there still exists a giant gap between the current understandinglevel of machines and human-level performance. Thus, thereis still much work that needs to be done in the field of sceneclassification. By investigating the current scene classificationalgorithms and the available data sets, this paper discussesseveral potential future directions for scene classification inremote sensing imagery.

Page 17: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 17

(1) Learning discriminative feature representations. Two keyfactors that influences the performance of scene classificationtasks are intraclass diversity and interclass similarity existingin remote sensing images. To tackle the challenges, somerepresentative methods [72], [73], [145] have been introducedover the past few years, such as multi-task learning (e.g.,unifying classification and similarity/metric learning) and de-signing/fusing CNNs. Even though these methods are effectiveto learn discriminative CNN features, the challenges of higherintraclass variation and smaller interclass separability are stillnot fully solved. These challenges seriously affect the per-formance of scene classification. In the future, learning morediscriminative feature representations to handle the challengesneeds to be addressed by various learning ways.

(2) Learning multi-scale features. In the task of remotesensing image scene classification, the same scene/object classcan appear in different scales due to the changes in imagingdistance and the intrinsic properties of scenes/objects in size,so how to learn multi-scale features has been a crucial andopen problem. Some researches [116], [123], [146]–[149] inmulti-scale representations have been done over the past fewdecades, such as multi-scale training, multi-resolution featurefusion, and changing receptive field. However, these existingmethods for learning scale-invariance features are far from thecapability of human vision and cannot easily respond to thechallenge of large variance of scene/object scale. For example,building deeper CNNs in order to extract high-level featureshas the side effect that small-sized object information is easilydiscarded. In the future, designing more robust way to extractmulti-scale features, especially for small-sized scenes/objects,would be promising for numerous vision tasks.

(3) Multi-label remote sensing image scene classification. Inthe past few decades, extensive efforts have been made for thetask of single-label image classification. However, in the realworld, it is extremely common that multiple ground objectswill appear in a remote sensing image because of the birds-eyeimaging method. Therefore, single-label remote sensing imagescene classification does not allow for a deep understandingof the intricate content of remote sensing images. In recentyears, research has been conducted on multi-label remotesensing image scene classification [150]–[156], but it still facesmany challenges that need to be further addressed, such ashow to exploit the relationship between different labels, howto learn more generalized discriminative features, and howto build large-scale multi-label remote sensing image sceneclassification data sets.

(4) Developing larger scale scene classification data sets.An ideal scene classification system would be capable ofaccurately and efficiently recognizing all scene types in allopen world scenes. Recent scene classification methods arestill trained with relatively limited data sets, so they arecapable of classifying scene categories within the training datasets but blind, in principle, to other scene classes outside thedata sets. Therefore, a compelling scene classification systemshould be able to accurately label a novel scene image witha semantic category. The existing data sets [49], [79], [80]contain dozens of scene classes, which are far fewer than thosethat humans can distinguish. Moreover, a common deep CNN

has millions of parameters and it tends to over-fit the tens ofthousands of training samples in the training set. Hence, fullytraining a deep classification model is almost impracticableby using currently available scene classification data sets. Amajority of advanced scene classification algorithms mainlyrely on fine-tuning already trained CNNs on the target data setsor utilizing pre-trained CNNs as feature extractors. Althoughthe transferring solutions behaves fairly well on the targetdata sets with limited types and samples, they are not themost optimal solution compared with fully training a deepCNN model because the model trained from scratch is able toextract more specific features that are adaptable to the targetdomain when training samples is large enough. Consideringthis, developing a new large scale data set with considerablymore scene classes for scene classification is very promising.

(5) Unsupervised learning for scene classification. Currently,the most advanced scene classification algorithms generallyuse fully supervised models learned from annotated datawith semantic categories and have achieved amazing sceneclassification results. However, such fully supervised learningis extremely expensive and time-consuming to undertake be-cause data annotation must be done manually by researcherswith expert knowledge of the area of remote sensing imageunderstanding. When the number of scene classes is huge,data annotation may become very difficult due to the massiveamount of diversities and variations in remote sensing images.Meanwhile, the labeled data is generally full of noise anderrors, especially for large-scale data sets, since the diverseknowledge levels of different specialists result in differentunderstandings of the same classes of scene. Fully supervisedlearning can hardly work well without a large data set withclean labels. As a promising unsupervised learning method,generative adversarial networks have been used for tacklingscene classification with data sets that lack annotations [83],[84], [137]. Consequently, it is valuable to explore unsuper-vised learning for scene classification.

(6) Compact and efficient scene classification models. Dur-ing the past few years, another key factor in the outstandingprogress in scene classification is the evolution of powerfuldeep CNNs. In order to achieve high accuracy in classification,the layer number of the CNNs has increased from severallayers to hundreds of layers. Most advanced CNN models havemillions of parameters and require a massive labeled data setfor training and high-performance GPUs, which severely limitsthe deploying of scene classification algorithms on airborneand satellite-borne embedded systems. In response, someresearchers are working to design compact and lightweightscene classification models [120], [121]. In this area, there ismuch work to be done.

(7) Scene classification with limited samples. CNNs haveobtained huge successes in the field of scene classification.However, most of those models demand large-scale labelleddata and numerous iterations to train their parameter sets.This extremely limits their scalability to novel categoriesbecause of the high cost of labeling. Also, this fundamentallyconfines their applicability to rare scene categories (e.g.,missile position, military zones), which are difficult to capture.In contrast, humans are adept at distinguishing scenes with

Page 18: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 18

little supervision learning, or none at all, such as few-shot[157] or zero-shot learning [158]. For instance, children canquickly and accurately recognize scene types ranging from asingle image on TV, in a book, or hearing its description. Thecurrent best scene classification approaches are still far fromachieving the humans ability to classify scene types with afew labelled samples. Exploring few-shot/zero-shot learningapproachs for scene classification [159]–[161] still needs tobe further developed.

(8) Cross-domain scene classification. Current researcheshave confirmed that CNNs are powerful tools for the taskof scene classification and CNN-based methods have attainedremarkable performance. However, the big achievements arebased on the fact that training and testing data obey the samedistribution. What will happen when train and test sets arefrom different domains? Can CNN models trained on a sourcedomain show good generalization on another target domain?Generally, the performance will drop significantly becausethere exists a big gap between the source and target domainson data distribution. In fact, these differences between sourceand target domains are quite common on remote sensing im-ages because of different imaging platforms (e.g., satellites andunmanned aerial vehicles) or different imaging sensors (opticalsensors, infrared sensors, and SAR sensors). In the past fewyears, some researchers have explored cross-domain sceneclassification to enhance the generalization of CNN modelsand reduce the distribution gap between the target and sourcedomains [162]–[165]. There is much potential for improvingdomain adaption-based methods for scene classification, suchas mapping the feature representations from target and sourcedomains onto a uniform space while preserving the originaldata structures, designing additional adaptation layers, andoptimizing the loss functions.

VII. CONCLUSIONS

Scene classification of remote sensing images has obtainedmajor improvements through several decades of development.The number of papers on remote sensing image scene classi-fication is breathtaking, especially the literature about deeplearning-based methods. By taking into account the rapidrate of progress in scene classification, in this paper, wefirst discussed the main challenges that the current area ofremote sensing image scene classification faces with. Then,we surveyed three kinds of deep learning-based methodsin detail and introduced the mainstream scene classificationbenchmarks. Next, we summarized the performance of deeplearning-based methods on three widely used data sets intabular forms, and also provided the analysis of the results.Finally, we discussed a set of promising opportunities forfurther research.

REFERENCES

[1] Q. Hu, W. Wu, T. Xia, Q. Yu, P. Yang, Z. Li, and Q. Song, “Exploringthe use of google earth imagery and object-based methods in landuse/cover mapping,” Remote Sensing, vol. 5, no. 11, pp. 6026–6042,2013.

[2] L. Gomez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multi-modal classification of remote sensing images: A review and futuredirections,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1560–1584,2015.

[3] P. Gamba, “Human settlements: A global challenge for eo data pro-cessing and interpretation,” Proceedings of the IEEE, vol. 101, no. 3,pp. 570–581, 2012.

[4] D. Li, M. Wang, Z. Dong, X. Shen, and L. Shi, “Earth observationbrain (eob): An intelligent earth observation system,” Geo-spatialinformation science, vol. 20, no. 2, pp. 134–140, 2017.

[5] N. Longbotham, C. Chaapel, L. Bleiler, C. Padwick, W. J. Emery,and F. Pacifici, “Very high resolution multiangle urban classificationanalysis,” IEEE Transactions on Geoscience and Remote Sensing,vol. 50, no. 4, pp. 1155–1170, 2011.

[6] A. Tayyebi, B. C. Pijanowski, and A. H. Tayyebi, “An urban growthboundary model using neural networks, gis and radial parameterization:An application to tehran, iran,” Landscape and Urban Planning,vol. 100, no. 1-2, pp. 35–44, 2011.

[7] T. R. Martha, N. Kerle, C. J. van Westen, V. Jetten, and K. V. Kumar,“Segment optimization and data-driven thresholding for knowledge-based landslide detection by object-based image analysis,” IEEE Trans-actions on Geoscience and Remote Sensing, vol. 49, no. 12, pp. 4928–4943, 2011.

[8] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang, “Automaticlandslide detection from remote-sensing imagery using a scene clas-sification method based on bovw and plsa,” International Journal ofRemote Sensing, vol. 34, no. 1-2, pp. 45–59, 2013.

[9] Z. Y. Lv, W. Shi, X. Zhang, and J. A. Benediktsson, “Landslideinventory mapping from bitemporal high-resolution remote sensingimages using change detection and multiscale segmentation,” IEEEjournal of selected topics in applied earth observations and remotesensing, vol. 11, no. 5, pp. 1520–1532, 2018.

[10] X. Huang, D. Wen, J. Li, and R. Qin, “Multi-level monitoring ofsubtle urban changes for the megacities of china using high-resolutionmulti-view satellite imagery,” Remote sensing of environment, vol. 196,pp. 56–75, 2017.

[11] T. Zhang and X. Huang, “Monitoring of urban impervious surfacesusing time series of high-resolution remote sensing images in rapidlyurbanized areas: A case study of shenzhen,” IEEE Journal of SelectedTopics in Applied Earth Observations and Remote Sensing, vol. 11,no. 8, pp. 2692–2708, 2018.

[12] F. Ghazouani, I. R. Farah, and B. Solaiman, “A multi-level semanticscene interpretation strategy for change interpretation in remote sensingimagery,” IEEE Transactions on Geoscience and Remote Sensing,vol. 57, no. 11, pp. 8775–8795, 2019.

[13] X. Li and G. Shao, “Object-based urban vegetation mapping with high-resolution aerial photography as a single data source,” Internationaljournal of remote sensing, vol. 34, no. 3, pp. 771–789, 2013.

[14] N. B. Mishra and K. A. Crews, “Mapping vegetation morphology typesin a dry savanna ecosystem: integrating hierarchical object-based imageanalysis with random forest,” International Journal of Remote Sensing,vol. 35, no. 3, pp. 1175–1198, 2014.

[15] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolu-tional neural networks for object detection in vhr optical remote sensingimages,” IEEE Transactions on Geoscience and Remote Sensing,vol. 54, no. 12, pp. 7405–7415, 2016.

[16] Y. Li, Y. Zhang, X. Huang, and A. L. Yuille, “Deep networks un-der scene-level supervision for multi-class geospatial object detectionfrom remote sensing images,” ISPRS Journal of Photogrammetry andRemote Sensing, vol. 146, pp. 182–196, 2018.

[17] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariantand fisher discriminative convolutional neural networks for objectdetection,” IEEE Transactions on Image Processing, vol. 28, no. 1,pp. 265–278, 2018.

[18] G. Cheng and J. Han, “A survey on object detection in optical re-mote sensing images,” ISPRS Journal of Photogrammetry and RemoteSensing, vol. 117, pp. 11–28, 2016.

[19] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection inoptical remote sensing images: A survey and a new benchmark,” ISPRSJournal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020.

[20] K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context-augmented object detection in remote sensing images,” IEEE Transac-tions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2337–2348,2017.

[21] G. Cheng, P. Zhou, and J. Han, “Rifd-cnn: Rotation-invariant and fisherdiscriminative convolutional neural networks for object detection,” inProceedings of the IEEE conference on computer vision and patternrecognition, pp. 2884–2893, 2016.

[22] G. Cheng, J. Han, L. Guo, and T. Liu, “Learning coarse-to-finesparselets for efficient object detection and scene classification,” in

Page 19: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 19

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 1173–1181, 2015.

[23] M. Ji and J. R. Jensen, “Effectiveness of subpixel analysis in detectingand quantifying urban imperviousness from landsat thematic mapperimagery,” Geocarto International, vol. 14, no. 4, pp. 33–41, 1999.

[24] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery,“Active learning methods for remote sensing image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 7,pp. 2218–2232, 2009.

[25] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari, “A sur-vey of active learning algorithms for supervised remote sensing imageclassification,” IEEE Journal of Selected Topics in Signal Processing,vol. 5, no. 3, pp. 606–617, 2011.

[26] L. L. Janssen and H. Middelkoop, “Knowledge-based crop classifi-cation of a landsat thematic mapper image,” International Journal ofRemote Sensing, vol. 13, no. 15, pp. 2827–2837, 1992.

[27] P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza, “Advanced spectralclassifiers for hyperspectral images: A review,” IEEE Geoscience andRemote Sensing Magazine, vol. 5, no. 1, pp. 8–32, 2017.

[28] L. He, J. Li, C. Liu, and S. Li, “Recent advances on spectral–spatialhyperspectral image classification: An overview and new guidelines,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 3,pp. 1579–1597, 2017.

[29] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson,“Deep learning for hyperspectral image classification: An overview,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9,pp. 6690–6709, 2019.

[30] G. Cheng, Z. Li, J. Han, X. Yao, and L. Guo, “Exploring hierarchicalconvolutional features for hyperspectral image classification,” IEEETransactions on Geoscience and Remote Sensing, vol. 56, no. 11,pp. 6712–6722, 2018.

[31] P. Zhou, J. Han, G. Cheng, and B. Zhang, “Learning compact anddiscriminative stacked autoencoder for hyperspectral image classifica-tion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57,no. 7, pp. 4823–4833, 2019.

[32] T. Blaschke and J. Strobl, “What’s wrong with pixels? some re-cent developments interfacing remote sensing and gis,” Zeitschrift furGeoinformationssysteme, pp. 12–17, 2001.

[33] T. Blaschke, “Object-based contextual image classification built on im-age segmentation,” in Proceedings of the IEEE Workshop on Advancesin Techniques for Analysis of Remotely Sensed Data, pp. 113–119,2003.

[34] G. Yan, J.-F. Mas, B. Maathuis, Z. Xiangmin, and P. Van Dijk,“Comparison of pixel-based and object-oriented image classificationapproaches?a case study in a coal fire area, wuda, inner mongolia,china,” International Journal of Remote Sensing, vol. 27, no. 18,pp. 4039–4055, 2006.

[35] T. Blaschke, “Object based image analysis for remote sensing,” ISPRSjournal of photogrammetry and remote sensing, vol. 65, no. 1, pp. 2–16, 2010.

[36] T. Blaschke, S. Lang, and G. Hay, Object-based image analysis: spatialconcepts for knowledge-driven remote sensing applications. SpringerScience & Business Media, 2008.

[37] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective andefficient midlevel visual elements-oriented land-use classification usingvhr remote sensing images,” IEEE Transactions on Geoscience andRemote Sensing, vol. 53, no. 8, pp. 4238–4249, 2015.

[38] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutionalneural networks for the scene classification of high-resolution remotesensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14680–14707,2015.

[39] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[40] R. M. Haralick, K. Shanmugam, and I. H. Dinstein, “Textural featuresfor image classification,” IEEE Transactions on systems, man, andcybernetics, no. 6, pp. 610–621, 1973.

[41] A. K. Jain, N. K. Ratha, and S. Lakshmanan, “Object detection usinggabor filters,” Pattern recognition, vol. 30, no. 2, pp. 295–309, 1997.

[42] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Transactions on pattern analysis and machine intelligence,vol. 24, no. 7, pp. 971–987, 2002.

[43] M. J. Swain and D. H. Ballard, “Color indexing,” International journalof computer vision, vol. 7, no. 1, pp. 11–32, 1991.

[44] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proceedings of the IEEE conference on computer visionand pattern recognition, vol. 1, pp. 886–893, 2005.

[45] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International journalof computer vision, vol. 42, no. 3, pp. 145–175, 2001.

[46] F. Perronnin, J. Sanchez, and T. Mensink, “Improving the fisher kernelfor large-scale image classification,” in Proceedings of the Europeanconference on computer vision, pp. 143–156, 2010.

[47] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid,“Aggregating local image descriptors into compact codes,” IEEE trans-actions on pattern analysis and machine intelligence, vol. 34, no. 9,pp. 1704–1716, 2011.

[48] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, vol. 2, pp. 2169–2178, 2006.

[49] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensionsfor land-use classification,” in Proceedings of the 18th SIGSPATIAL in-ternational conference on advances in geographic information systems,pp. 270–279, 2010.

[50] Y. Yang and S. Newsam, “Spatial pyramid co-occurrence for imageclassification,” in Proceedings of the International Conference onComputer Vision, pp. 1465–1472, 2011.

[51] W. Shao, W. Yang, G.-S. Xia, and G. Liu, “A hierarchical schemeof multiple feature fusion for high-resolution satellite scene catego-rization,” in Proceedings of the International Conference on ComputerVision Systems, pp. 324–333, 2013.

[52] R. Negrel, D. Picard, and P.-H. Gosselin, “Evaluation of second-order visual features for land-use classification,” in Proceedings of theInternational Workshop on Content-Based Multimedia Indexing, pp. 1–5, 2014.

[53] L.-J. Zhao, P. Tang, and L.-Z. Huo, “Land-use scene classification usinga concentric circle-structured multiscale bag-of-visual-words model,”IEEE Journal of Selected Topics in Applied Earth Observations andRemote Sensing, vol. 7, no. 12, pp. 4620–4631, 2014.

[54] Y. Zhang, X. Sun, H. Wang, and K. Fu, “High-resolution remote-sensing image classification via an approximate earth mover’s distance-based bag-of-features model,” IEEE Geoscience and Remote SensingLetters, vol. 10, no. 5, pp. 1055–1059, 2013.

[55] Q. Zhu, Y. Zhong, B. Zhao, G.-S. Xia, and L. Zhang, “Bag-of-visual-words scene classifier with local and global features for high spatialresolution remote sensing imagery,” IEEE Geoscience and RemoteSensing Letters, vol. 13, no. 6, pp. 747–751, 2016.

[56] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3,pp. 37–52, 1987.

[57] B. A. Olshausen and D. J. Field, “Sparse coding with an overcompletebasis set: A strategy employed by v1?,” Vision research, vol. 37, no. 23,pp. 3311–3325, 1997.

[58] A. M. Cheriyadat, “Unsupervised feature learning for aerial sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 52, no. 1, pp. 439–451, 2013.

[59] M. L. Mekhalfi, F. Melgani, Y. Bazi, and N. Alajlan, “Land-useclassification with compressive sensing multifeature fusion,” IEEEGeoscience and Remote Sensing Letters, vol. 12, no. 10, pp. 2155–2159, 2015.

[60] V. Risojevic and Z. Babic, “Unsupervised quaternion feature learningfor remote sensing image classification,” IEEE Journal of SelectedTopics in Applied Earth Observations and Remote Sensing, vol. 9,no. 4, pp. 1521–1531, 2016.

[61] G. Sheng, W. Yang, T. Xu, and H. Sun, “High-resolution satellitescene classification using a sparse coding based multiple featurecombination,” International journal of remote sensing, vol. 33, no. 8,pp. 2395–2412, 2012.

[62] X. Zheng, X. Sun, K. Fu, and H. Wang, “Automatic annotation ofsatellite images via multifeature joint sparse coding with spatial relationconstraint,” IEEE Geoscience and Remote Sensing Letters, vol. 10,no. 4, pp. 652–656, 2012.

[63] Y. Zhong, Q. Zhu, and L. Zhang, “Scene classification based on themultifeature fusion probabilistic topic model for high spatial resolu-tion remote sensing imagery,” IEEE Transactions on Geoscience andRemote Sensing, vol. 53, no. 11, pp. 6207–6222, 2015.

[64] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classificationby unsupervised representation learning,” IEEE Transactions on Geo-science and Remote Sensing, vol. 55, no. 9, pp. 5148–5157, 2017.

Page 20: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 20

[65] J. Fan, T. Chen, and S. Lu, “Unsupervised feature learning for land-use scene recognition,” IEEE Transactions on Geoscience and RemoteSensing, vol. 55, no. 4, pp. 2250–2261, 2017.

[66] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep featureextraction for remote sensing image classification,” IEEE Transactionson Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1349–1362,2015.

[67] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science, vol. 313, no. 5786, pp. 504–507,2006.

[68] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithmfor deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[69] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations ina deep network with a local denoising criterion,” Journal of machinelearning research, vol. 11, no. Dec, pp. 3371–3408, 2010.

[70] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proceedings of theAdvances in neural information processing systems, pp. 1097–1105,2012.

[71] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Proceedings of the IEEEconference on computer vision and pattern recognition, pp. 248–255,2009.

[72] R. Minetto, M. P. Segundo, and S. Sarkar, “Hydra: An ensembleof convolutional neural networks for geospatial land classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9,pp. 6530–6541, 2019.

[73] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learningmeets metric learning: Remote sensing image scene classification vialearning discriminative cnns,” IEEE transactions on geoscience andremote sensing, vol. 56, no. 5, pp. 2811–2821, 2018.

[74] Q. Wang, S. Liu, J. Chanussot, and X. Li, “Scene classification withrecurrent attention of vhr remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 57, no. 2, pp. 1155–1167,2018.

[75] M. Li, S. Zang, B. Zhang, S. Li, and C. Wu, “A review of remotesensing image classification techniques: The role of spatio-contextualinformation,” European Journal of Remote Sensing, vol. 47, no. 1,pp. 389–411, 2014.

[76] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensingdata: A technical tutorial on the state of the art,” IEEE Geoscience andRemote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.

[77] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, andF. Fraundorfer, “Deep learning in remote sensing: A comprehensivereview and list of resources,” IEEE Geoscience and Remote SensingMagazine, vol. 5, no. 4, pp. 8–36, 2017.

[78] U. Maulik and D. Chakraborty, “Remote sensing image classification:A survey of support-vector-machine-based advanced techniques,” IEEEGeoscience and Remote Sensing Magazine, vol. 5, no. 1, pp. 33–52,2017.

[79] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,“Aid: A benchmark data set for performance evaluation of aerial sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 55, no. 7, pp. 3965–3981, 2017.

[80] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classi-fication: Benchmark and state of the art,” Proceedings of the IEEE,vol. 105, no. 10, pp. 1865–1883, 2017.

[81] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deeplearning in remote sensing applications: A meta-analysis and review,”ISPRS journal of photogrammetry and remote sensing, vol. 152,pp. 166–177, 2019.

[82] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inProceedings of the Advances in neural information processing systems,pp. 2672–2680, 2014.

[83] Y. Duan, X. Tao, M. Xu, C. Han, and J. Lu, “Gan-nl: Unsupervisedrepresentation learning for remote sensing image classification,” inProceedings of the IEEE Global Conference on Signal and InformationProcessing, pp. 375–379, 2018.

[84] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun, “Marta gans: Unsupervisedrepresentation learning for remote sensing image classification,” IEEEGeoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2092–2096, 2017.

[85] O. A. Penatti, K. Nogueira, and J. A. Dos Santos, “Do deep featuresgeneralize from everyday objects to remote sensing and aerial scenes

domains?,” in Proceedings of the IEEE conference on computer visionand pattern recognition workshops, pp. 44–51, 2015.

[86] K. Nogueira, O. A. Penatti, and J. A. Dos Santos, “Towards betterexploiting convolutional neural networks for remote sensing sceneclassification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.

[87] B. Zhang, Z. Chen, D. Peng, J. A. Benediktsson, B. Liu, L. Zou,J. Li, and A. Plaza, “Remotely sensed big data: evolution in modeldevelopment for information extraction [point of view],” Proceedingsof the IEEE, vol. 107, no. 12, pp. 2294–2301, 2019.

[88] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised featurelearning for scene classification,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 53, no. 4, pp. 2175–2184, 2014.

[89] G. Cheng, J. Han, L. Guo, and T. Liu, “Learning coarse-to-finesparselets for efficient object detection and scene classification,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 1173–1181, 2015.

[90] R. Girshick, H. O. Song, and T. Darrell, “Discriminatively activatedsparselets,” in Proceedings of the International Conference on MachineLearning, pp. 196–204, 2013.

[91] E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani, “Usingconvolutional features and a sparse autoencoder for land-use sceneclassification,” International Journal of Remote Sensing, vol. 37, no. 10,pp. 2149–2167, 2016.

[92] X. Han, Y. Zhong, B. Zhao, and L. Zhang, “Scene classification basedon a hierarchical convolutional sparse auto-encoder for high spatialresolution imagery,” International Journal of Remote Sensing, vol. 38,no. 2, pp. 514–536, 2017.

[93] G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han, “Auto-encoder-basedshared mid-level visual dictionary learning for scene classificationusing very high resolution remote sensing images,” IET ComputerVision, vol. 9, no. 5, pp. 639–647, 2015.

[94] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, “Stacked con-volutional denoising auto-encoders for feature representation,” IEEEtransactions on cybernetics, vol. 47, no. 4, pp. 1017–1027, 2016.

[95] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo, “Semantic annotationof high-resolution satellite images via weakly supervised learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 6,pp. 3660–3671, 2016.

[96] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[97] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE conference on computervision and pattern recognition, pp. 1–9, 2015.

[98] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, pp. 770–778, 2016.

[99] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in Proceedings of theIEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.

[100] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProceedings of the IEEE conference on computer vision and patternrecognition, pp. 7132–7141, 2018.

[101] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” inProceedings of the IEEE conference on computer vision and patternrecognition, pp. 510–519, 2019.

[102] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradientboosting random convolutional network framework,” IEEE Transac-tions on Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1793–1802, 2015.

[103] Y. Zhong, F. Fei, and L. Zhang, “Large patch convolutional neuralnetworks for the scene classification of high spatial resolution imagery,”Journal of Applied Remote Sensing, vol. 10, no. 2, p. 025006, 2016.

[104] G. Cheng, Z. Li, X. Yao, L. Guo, and Z. Wei, “Remote sensingimage scene classification using bag of convolutional features,” IEEEGeoscience and Remote Sensing Letters, vol. 14, no. 10, pp. 1735–1739, 2017.

[105] Y. Yu, Z. Gong, C. Wang, and P. Zhong, “An unsupervised convolu-tional feature fusion network for deep representation of remote sensingimages,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 1,pp. 23–27, 2017.

[106] Y. Liu, Y. Zhong, F. Fei, Q. Zhu, and Q. Qin, “Scene classificationbased on a deep random-scale stretched convolutional neural network,”Remote Sensing, vol. 10, no. 3, p. 444, 2018.

Page 21: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 21

[107] Q. Zhu, Y. Zhong, Y. Liu, L. Zhang, and D. Li, “A deep-local-globalfeature fusion framework for high spatial resolution imagery sceneclassification,” Remote Sensing, vol. 10, no. 4, p. 568, 2018.

[108] D. Marmanis, M. Datcu, T. Esch, and U. Stilla, “Deep learning earthobservation classification using imagenet pretrained networks,” IEEEGeoscience and Remote Sensing Letters, vol. 13, no. 1, pp. 105–109,2015.

[109] S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for vhrremote sensing scene classification,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 55, no. 8, pp. 4775–4784, 2017.

[110] E. Li, J. Xia, P. Du, C. Lin, and A. Samat, “Integrating multilayerfeatures of convolutional neural networks for remote sensing sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 55, no. 10, pp. 5653–5665, 2017.

[111] Y. Yuan, J. Fang, X. Lu, and Y. Feng, “Remote sensing image sceneclassification using rearranged local features,” IEEE Transactions onGeoscience and Remote Sensing, vol. 57, no. 3, pp. 1779–1792, 2018.

[112] N. He, L. Fang, S. Li, A. Plaza, and J. Plaza, “Remote sensingscene classification using multilayer stacked covariance pooling,” IEEETransactions on Geoscience and Remote Sensing, vol. 56, no. 12,pp. 6899–6910, 2018.

[113] X. Lu, H. Sun, and X. Zheng, “A feature aggregation convolutional neu-ral network for remote sensing scene classification,” IEEE Transactionson Geoscience and Remote Sensing, vol. 57, no. 10, pp. 7894–7906,2019.

[114] M. Castelluccio, G. Poggi, C. Sansone, and L. Verdoliva, “Landuse classification in remote sensing images by convolutional neuralnetworks,” arXiv preprint arXiv:1508.00092, 2015.

[115] Y. Liu, C. Y. Suen, Y. Liu, and L. Ding, “Scene classification usinghierarchical Wasserstein CNN,” IEEE Transactions on Geoscience andRemote Sensing, vol. 57, no. 5, pp. 2494–2509, 2019.

[116] Y. Liu, Y. Zhong, and Q. Qin, “Scene classification based on multiscaleconvolutional neural network,” IEEE Transactions on Geoscience andRemote Sensing, vol. 56, no. 12, pp. 7109–7121, 2018.

[117] J. Fang, Y. Yuan, X. Lu, and Y. Feng, “Robust space–frequency jointrepresentation for remote sensing image scene classification,” IEEETransactions on Geoscience and Remote Sensing, vol. 57, no. 10,pp. 7492–7502, 2019.

[118] J. Xie, N. He, L. Fang, and A. Plaza, “Scale-free convolutional neuralnetwork for remote sensing scene classification,” IEEE Transactions onGeoscience and Remote Sensing, vol. 57, no. 9, pp. 6916–6928, 2019.

[119] H. Sun, S. Li, X. Zheng, and X. Lu, “Remote sensing scene classifica-tion by gated bidirectional network,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 58, no. 1, pp. 82–96, 2019.

[120] G. Chen, X. Zhang, X. Tan, Y. Cheng, F. Dai, K. Zhu, Y. Gong, andQ. Wang, “Training small networks for scene classification of remotesensing images via knowledge distillation,” Remote Sensing, vol. 10,no. 5, p. 719, 2018.

[121] B. Zhang, Y. Zhang, and S. Wang, “A lightweight and discrimina-tive model for remote sensing scene classification with multidilationpooling module,” IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, vol. 12, no. 8, pp. 2636–2653, 2019.

[122] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedingsof the IEEE conference on computer vision and pattern recognition,pp. 4510–4520, 2018.

[123] N. He, L. Fang, S. Li, J. Plaza, and A. Plaza, “Skip-connectedcovariance network for remote sensing scene classification,” IEEEtransactions on neural networks and learning systems, 2019.

[124] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Maıtre,“Structural high-resolution satellite image indexing,” 2010.

[125] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based featureselection for remote sensing scene classification,” IEEE Geoscienceand Remote Sensing Letters, vol. 12, no. 11, pp. 2321–2325, 2015.

[126] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano, M. Karki, andR. Nemani, “Deepsat: a learning framework for satellite imagery,”in Proceedings of the 23rd SIGSPATIAL international conference onadvances in geographic information systems, pp. 1–10, 2015.

[127] B. Zhao, Y. Zhong, G.-S. Xia, and L. Zhang, “Dirichlet-derivedmultiple topic scene classification model for high spatial resolution re-mote sensing imagery,” IEEE Transactions on Geoscience and RemoteSensing, vol. 54, no. 4, pp. 2108–2123, 2015.

[128] L. Zhao, P. Tang, and L. Huo, “Feature significance-based multibag-of-visual-words model for remote sensing image scene classification,”Journal of Applied Remote Sensing, vol. 10, no. 3, p. 035004, 2016.

[129] H. Li, C. Tao, Z. Wu, J. Chen, J. Gong, and M. Deng, “Rsci-cb: A largescale remote sensing image classification benchmark via crowdsourcedata,” arXiv preprint arXiv:1705.10450, 2017.

[130] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A noveldataset and deep learning benchmark for land use and land coverclassification,” IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.

[131] G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “Bigearthnet: Alarge-scale benchmark archive for remote sensing image understand-ing,” in Proceedings of the IEEE International Geoscience and RemoteSensing Symposium, pp. 5901–5904, 2019.

[132] L. Bashmal, Y. Bazi, H. AlHichri, M. M. AlRahhal, N. Ammour, andN. Alajlan, “Siamese-gan: Learning invariant representations for aerialvehicle image categorization,” Remote Sensing, vol. 10, no. 2, p. 351,2018.

[133] S. Xu, X. Mu, D. Chai, and X. Zhang, “Remote sensing imagescene classification based on generative adversarial networks,” Remotesensing letters, vol. 9, no. 7, pp. 617–626, 2018.

[134] D. Ma, P. Tang, and L. Zhao, “Siftinggan: Generating and siftinglabeled samples to improve the remote sensing image scene classifica-tion baseline in vitro,” IEEE Geoscience and Remote Sensing Letters,vol. 16, no. 7, pp. 1046–1050, 2019.

[135] W. Teng, N. Wang, H. Shi, Y. Liu, and J. Wang, “Classifier-constraineddeep adversarial domain adaptation for cross-domain semisupervisedclassification in remote sensing images,” IEEE Geoscience and RemoteSensing Letters, 2019.

[136] W. Han, R. Feng, L. Wang, and Y. Cheng, “A semi-supervised genera-tive framework with deep learning features for high-resolution remotesensing image scene classification,” ISPRS Journal of Photogrammetryand Remote Sensing, vol. 145, pp. 23–43, 2018.

[137] Y. Yu, X. Li, and F. Liu, “Attention gans: Unsupervised deep featurelearning for aerial scene classification,” IEEE Transactions on Geo-science and Remote Sensing, vol. 58, no. 1, pp. 519–531, 2019.

[138] Q. Zhu, Y. Zhong, L. Zhang, and D. Li, “Adaptive deep sparsesemantic modeling framework for high spatial resolution image sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 56, no. 10, pp. 6180–6195, 2018.

[139] R. Zhu, L. Yan, N. Mo, and Y. Liu, “Attention-based deep featurefusion for the scene classification of high-resolution remote sensingimages,” Remote Sensing, vol. 11, no. 17, p. 1996, 2019.

[140] W. Zhang, P. Tang, and L. Zhao, “Remote sensing image sceneclassification using cnn-capsnet,” Remote Sensing, vol. 11, no. 5, p. 494,2019.

[141] X. Liu, Y. Zhou, J. Zhao, R. Yao, B. Liu, and Y. Zheng, “Siameseconvolutional neural networks for remote sensing scene classification,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 8, pp. 1200–1204, 2019.

[142] Y. Liu, Y. Liu, and L. Ding, “Scene classification by coupling convo-lutional neural networks with wasserstein distance,” IEEE Geoscienceand Remote Sensing Letters, vol. 16, no. 5, pp. 722–726, 2019.

[143] J. Wang, W. Liu, L. Ma, H. Chen, and L. Chen, “Iorn: An effective re-mote sensing image scene classification framework,” IEEE Geoscienceand Remote Sensing Letters, vol. 15, no. 11, pp. 1695–1699, 2018.

[144] M. A. Dede, E. Aptoula, and Y. Genc, “Deep network ensemblesfor aerial scene classification,” IEEE Geoscience and Remote SensingLetters, vol. 16, no. 5, pp. 732–735, 2019.

[145] X. Zheng, Y. Yuan, and X. Lu, “A deep scene representation for aerialscene classification,” IEEE Transactions on Geoscience and RemoteSensing, vol. 57, no. 7, pp. 4799–4809, 2019.

[146] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, andP. H. Torr, “Res2net: A new multi-scale backbone architecture,” IEEEtransactions on pattern analysis and machine intelligence, 2019.

[147] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in Proceedings of theIEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017.

[148] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,”in Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 2403–2412, 2018.

[149] G. Cheng, Y. Si, H. Hong, X. Yao, and L. Guo, “Cross-scale featurefusion for object detection in optical remote sensing images,” IEEEGeoscience and Remote Sensing Letters, pp. 1–5, 2020.

[150] Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wiseattention in a hybrid convolutional and bidirectional lstm network formulti-label aerial image classification,” ISPRS journal of photogram-metry and remote sensing, vol. 149, pp. 188–199, 2019.

Page 22: Remote Sensing Image Scene Classification Meets Deep ... · Abstract—Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXXX 2019 22

[151] B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabelremote sensing image retrieval using a semisupervised graph-theoreticmethod,” IEEE Transactions on Geoscience and Remote Sensing,vol. 56, no. 2, pp. 1144–1158, 2017.

[152] R. Stivaktakis, G. Tsagkatakis, and P. Tsakalides, “Deep learning formultilabel land cover scene categorization using data augmentation,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 7, pp. 1031–1035, 2019.

[153] Y. Hua, L. Mou, and X. X. Zhu, “Relation network for multilabel aerialimage classification,” IEEE Transactions on Geoscience and RemoteSensing, 2020.

[154] N. Khan, U. Chaudhuri, B. Banerjee, and S. Chaudhuri, “Graph convo-lutional network for multi-label vhr remote sensing scene recognition,”Neurocomputing, vol. 357, pp. 36–46, 2019.

[155] B. T. Zegeye and B. Demir, “A novel active learning technique formulti-label remote sensing image scene classification,” in Proceed-ings of the Image and Signal Processing for Remote Sensing XXIV,vol. 10789, p. 107890B, 2018.

[156] G. Cheng, D. Gao, Y. Liu, and J. Han, “Multi-scale and discriminativepart detectors based features for multi-label image classification,”pp. 649–655, 2018.

[157] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales,“Learning to compare: Relation network for few-shot learning,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 1199–1208, 2018.

[158] M. Ye and Y. Guo, “Zero-shot classification with discriminative seman-tic representation learning,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 7140–7148, 2017.

[159] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networksfor one-shot image recognition,” in Proceedings of the ICML deeplearning workshop, vol. 2, 2015.

[160] M. Zhai, H. Liu, and F. Sun, “Lifelong learning for scene recognition inremote sensing images,” IEEE Geoscience and Remote Sensing Letters,vol. 16, no. 9, pp. 1472–1476, 2019.

[161] A. Li, Z. Lu, L. Wang, T. Xiang, and J.-R. Wen, “Zero-shot sceneclassification for high spatial resolution remote sensing images,” IEEETransactions on Geoscience and Remote Sensing, vol. 55, no. 7,pp. 4157–4167, 2017.

[162] N. Ammour, L. Bashmal, Y. Bazi, M. M. Al Rahhal, and M. Zuair,“Asymmetric adaptation of deep features for cross-domain classifica-tion in remote sensing imagery,” IEEE Geoscience and Remote SensingLetters, vol. 15, no. 4, pp. 597–601, 2018.

[163] E. Othman, Y. Bazi, F. Melgani, H. Alhichri, N. Alajlan, and M. Zuair,“Domain adaptation network for cross-scene classification,” IEEETransactions on Geoscience and Remote Sensing, vol. 55, no. 8,pp. 4441–4456, 2017.

[164] X. Lu, T. Gong, and X. Zheng, “Multisource compensation network forremote sensing cross-domain scene classification,” IEEE Transactionson Geoscience and Remote Sensing, 2019.

[165] S. Song, H. Yu, Z. Miao, Q. Zhang, Y. Lin, and S. Wang, “Domainadaptation for convolutional neural networks-based remote sensingscene classification,” IEEE Geoscience and Remote Sensing Letters,vol. 16, no. 8, pp. 1324–1328, 2019.

Gong Cheng received the B.S. degree from XidianUniversity, Xi’an, China, in 2007, and the M.S.and Ph.D. degrees from Northwestern PolytechnicalUniversity, Xi’an, China, in 2010 and 2013, respec-tively. He is currently a Professor with NorthwesternPolytechnical University, Xi’an, China. His mainresearch interests are computer vision, pattern recog-nition, and remote sensing image understanding.He is an associate editor of IEEE Geoscience andRemote Sensing Magazine and a guest editor ofIEEE Journal of Selected Topics in Applied Earth

Observations and Remote Sensing.

Xingxing Xie received the B.S. degree from InnerMongolia University, Huhhot, China, in 2015, andthe M.S. degree from Northwestern PolytechnicalUniversity, Xi’an, China, in 2018. Currently, he ispursuing the doctoral degree at Northwestern Poly-technical University. His main research interests arecomputer vision and pattern recognition.

Junwei Han received his B.S., M.S., and Ph.D.degrees in pattern recognition and intelligent sys-tems in 1999, 2001, and 2003, respectively, allfrom Northwestern Polytechnical University, Xi’an,China, where he is currently a professor. He was aresearch fellow at Nanyang Technological Univer-sity, The Chinese Univer-sity of Hong Kong, DublinCity University, and the University of Dundee from2003 to 2010. His research interests include com-puter vision and brain-imaging analysis. He is anassociate editor of IEEE Transactions on Neural Net-

works and Learning Systems, IEEE Transac-tions on Circuits and Systemsfor Video Tech-nology, IEEE Transactions on Human-Machine Systems,Neurocomputing, and Machine Vision and Applications.

Lei Guo received the B.S. and M.S. degrees fromXidian University, Xi’an, China, in 1982 and 1986,respectively, and the Ph.D. degree from Northwest-ern Polytechnical University, Xi’an, China, in 1993.He is a Professor with the School of Automa-tion, Northwestern Polytechnical University, Xi’an,China. His research interest focuses on image pro-cessing.

Gui-Song Xia (M’10-SM’15) received his Ph.D.degree in image processing and computer visionfrom CNRS LTCI, Telecom ParisTech, Paris, France,in 2011. From 2011 to 2012, he has been a Post-Doctoral Researcher with the Centre de Rechercheen Mathematiques de la Decision, CNRS, Paris-Dauphine University, Paris, for one and a half years.He is currently working as a full professor in com-puter vision and photogrammetry at Wuhan Univer-sity. He has also been working as Visiting Scholarat DMA, Ecole Normale Superieure (ENS-Paris) for

two months in 2018. His current research interests include mathematicalmodeling of images and videos, structure from motion, perceptual grouping,and remote sensing imaging. He serves on the Editorial Boards of thejournals Pattern Recognition, Signal Processing: Image Communications, andEURASIP Journal on Image & Video Processing.