oa-capsnet: a one-stage anchor-free capsule network for

Full Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=ujrs20

Canadian Journal of Remote SensingJournal canadien de télédétection

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ujrs20

OA-CapsNet: A One-Stage Anchor-Free CapsuleNetwork for Geospatial Object Detection fromRemote Sensing Imagery

Yongtao Yu, Junyong Gao, Chao Liu, Haiyan Guan, Dilong Li, Changhui Yu,Shenghua Jin, Fenfen Li & Jonathan Li

To cite this article: Yongtao Yu, Junyong Gao, Chao Liu, Haiyan Guan, Dilong Li, Changhui Yu,Shenghua Jin, Fenfen Li & Jonathan Li (2021): OA-CapsNet: A One-Stage Anchor-Free CapsuleNetwork for Geospatial Object Detection from Remote Sensing Imagery , Canadian Journal ofRemote Sensing, DOI: 10.1080/07038992.2021.1898937

To link to this article: https://doi.org/10.1080/07038992.2021.1898937

Published online: 17 Mar 2021.

Submit your article to this journal

Article views: 18

View related articles

View Crossmark data

https://www.tandfonline.com/action/journalInformation?journalCode=ujrs20

https://www.tandfonline.com/loi/ujrs20

https://www.tandfonline.com/action/showCitFormats?doi=10.1080/07038992.2021.1898937

https://doi.org/10.1080/07038992.2021.1898937

https://www.tandfonline.com/action/authorSubmission?journalCode=ujrs20&show=instructions

https://www.tandfonline.com/action/authorSubmission?journalCode=ujrs20&show=instructions

https://www.tandfonline.com/doi/mlt/10.1080/07038992.2021.1898937

https://www.tandfonline.com/doi/mlt/10.1080/07038992.2021.1898937

http://crossmark.crossref.org/dialog/?doi=10.1080/07038992.2021.1898937&domain=pdf&date_stamp=2021-03-17


OA-CapsNet: A One-Stage Anchor-Free Capsule Network for GeospatialObject Detection from Remote Sensing Imagery

OA-CapsNet: Un r�eseau capsule sans ancrage en une seule �etape pour lad�etection d’objets g�eospatiaux �a partir d’mages de t�el�ed�etection

Yongtao Yua , Junyong Gaoa, Chao Liua, Haiyan Guanb , Dilong Lic , Changhui Yua, Shenghua Jina,Fenfen Lia, and Jonathan Lid

aFaculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huaian, China; bSchool of Remote Sensing andGeomatics Engineering, Nanjing University of Information Science and Technology, Nanjing, China; cState Key Laboratory ofInformation Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China; dDepartment of Geographyand Environmental Management, University of Waterloo, Waterloo, Canada

ABSTRACTObject detection from remote sensing images serves as an important prerequisite to manyapplications. However, caused by scale and orientation variations, appearance and distributiondiversities, occlusion and shadow contaminations, and complex environmental scenarios ofthe objects in remote sensing images, it brings great challenges to realize highly accurate rec-ognition of geospatial objects. This paper proposes a novel one-stage anchor-free capsule net-work (OA-CapsNet) for detecting geospatial objects from remote sensing images. Byemploying a capsule feature pyramid network architecture as the backbone, a pyramid ofhigh-quality, semantically strong feature representations are generated at multiple scales forobject detection. Integrated with two types of capsule feature attention modules, the featurequality is further enhanced by emphasizing channel-wise informative features and class-spe-cific spatial features. By designing a centreness-assisted one-stage anchor-free object detectionstrategy, the proposed OA-CapsNet performs effectively in recognizing arbitrarily-orientatedand diverse-scale geospatial objects. Quantitative evaluations on two large remote sensingdatasets show that a competitive overall accuracy with a precision, a recall, and an Fscore of0.9625, 0.9228, and 0.9423, respectively, is achieved. Comparative studies also confirm thefeasibility and superiority of the proposed OA-CapsNet in geospatial object detection tasks.

RÉSUMÉ

La d�etection d’objets �a partir d’images de t�el�ed�etection constitue une condition pr�ealableimportante �a de nombreuses applications. Cependant, les variations d’�echelle et d’orienta-tion, les diversit�es d’apparence et de distribution, les contaminations de l’occlusion et del’ombre, et des sc�enarios environnementaux complexes des objets dans les images appor-tent de grands d�efis pour r�ealiser la reconnaissance tr�es pr�ecise des objets g�eospatiaux. Cetarticle propose un nouveau r�eseau capsule sans ancrage en une seule �etape (OA-CapsNet)pour d�etecter les objets g�eospatiaux �a partir d’images de t�el�ed�etection. En utilisant unearchitecture de r�eseau pyramidal �a capsules comme colonne vert�ebrale, une pyramide derepr�esentations de fonctionnalit�es de haute qualit�e et s�emantiquement fortes sont g�en�er�ees�a plusieurs �echelles pour la d�etection d’objets. Int�egr�ee �a deux types de modules �a capsules,la qualit�e des fonctionnalit�es est encore am�elior�ee en mettant l’accent sur les fonctionnalit�esinformatives du cot�e des canaux et des caract�eristiques spatiales sp�ecifiques �a la classe. Enconcevant une strat�egie de d�etection d’objets sans ancrage �a un �etage assist�ee par la cen-tralit�e, l’OA-CapsNet propos�e fonctionne efficacement dans la reconnaissance d’objetsg�eospatiaux arbitrairement orient�es et diversifi�es. Les �evaluations quantitatives sur deuxgrands ensembles de donn�ees de t�el�ed�etection montrent qu’une exactitude concurrentielleglobale est atteinte avec une pr�ecision, un recall et un Fscore de 0,9625, 0,9228 et 0,9423respectivement, Des �etudes comparatives confirment �egalement la faisabilit�e et lasup�eriorit�e de l’OA-CapsNet dans les taches de d�etection d’objets g�eospatiaux.

ARTICLE HISTORYReceived 19 November 2020Accepted 28 February 2021

CONTACT Yongtao Yu [email protected], [email protected] � CASI

CANADIAN JOURNAL OF REMOTE SENSINGhttps://doi.org/10.1080/07038992.2021.1898937


http://orcid.org/0000-0001-7204-9346

http://orcid.org/0000-0003-3691-8721

http://orcid.org/0000-0002-5826-5568

http://orcid.org/0000-0001-7899-0049

https://doi.org/10.1080/07038992.2021.1898937

http://www.tandfonline.com

Introduction

With the increasing advancement of optical remotesensing sensors in flexibility, cost-efficiency, reso-lution, and quality, remote sensing images havebecome an important data source for many applica-tions, such as land cover mapping, landmark recogni-tion, environmental analysis, and intelligenttransportation systems. As a typical research topic,geospatial object detection targets to correctly identifythe objects of interest and accurately locate their posi-tions in the remote sensing image. To date, a numberof techniques with continuously improving perform-ances have been developed for geospatial object detec-tion tasks in the literature. However, it is still achallenging issue to fulfill highly accurate and fullyautomated detection of geospatial objects due to thecomplicated scenarios of the geospatial objects in thebird-view remote sensing images, such as scale andorientation variations, texture and distribution diver-sities, occlusion and shadow contaminations, illumin-ation condition changes, and complex environmentalconditions. Thus, exploiting advanced techniques tofurther enhance the accuracy and automation level ofgeospatial object detection is greatly meaningful andpositively favorable to a wide range of applications.

Recent increasing development of deep learningtechniques has attracted great attentions on geospatialobject detection by using deep learning models (Li,Wan, et al. 2020). Specifically, the existing deep learn-ing based geospatial object detection methods can beroughly categorized into one-stage methods and two-stage methods. The two-stage methods comprise twocascaded processing modules for, respectively, regionproposal generation and object recognition. Usually, aregion proposal subnetwork is designed to generate aset of dense object region proposals, which are furtherverified by a classification subnetwork for recognizingthe objects of interest (Gong et al. 2020; Li et al. 2019;Liu et al. 2021; Yu et al. 2020; Zheng et al. 2020). Incontrast, the one-stage methods accomplish featureextraction and object detection with a single networkwithout the pre-generation of object region proposals.Hu et al. (2019) proposed a convolutional neural net-work (CNN), which was trained with a sample updat-ing strategy, to detect objects in large-area remotesensing images. The updated artificial composite sam-ples were used to fine-tune the object detector. Basedon the YOLOv2 architecture, Liu et al. (2019)designed a multilayer feature concatenation and fea-ture introducing strategy aiming to improve theadaptability of the network to multiscale objects, espe-cially the small-size objects. By extracting multiscale

features and learning visual attentions at each featurescale, Wang et al. (2019) developed a multiscale visualattention network (MS-VAN) to detect multiscaleobjects. The multiscale features were extracted andproperly fused through a skip-connected encoder-decoder backbone. To upgrade the detection accuracyof small-size objects, Qin et al. (2021) proposed a spe-cially optimized one-stage network (SOON), whichcomprised three parts of feature enhancement, multi-scale detection, and feature fusion. In this network,the spatial information of small-size objects was con-centrated on by incorporating a receptive fieldenhancement module. Similarly, a one-stage networkintegrated with residual blocks at multiple scales wasconstructed by Mandal et al. (2020) for detectingsmall-size vehicles. The residual blocks, alongside theenlarged output feature map, enhanced the representa-tion robustness of the feature saliencies for small-sizeobjects. Tang et al. (2017) proposed an orientated sin-gle-shot multi-box detector (SSD) for detecting arbi-trarily-orientated vehicles. This model deployed a setof default orientated anchors with varying scales ateach position of the feature map to produce detectionbounding boxes. Zhang, Liu, et al. (2020) designed adepthwise-separable attention-guided network(DAGN) to detect vehicles by integrating a featureconcatenation and attention block into the YOLOv3architecture. With the combination of the multi-levelfeature concatenation and channel feature attentionmechanisms, the feature representation capability ofthe network was dramatically enhanced to serve forthe small-size vehicles. In addition, YOLO-fine (Phamet al. 2020) and H-YOLO (Tang et al. 2020) modelswere also developed based on the YOLOv3 architec-ture to detect geospatial objects. Specifically, theYOLO-fine model performed successfully in detectingsmall-size objects with both high accuracy and highspeed, which was applicable for real-time applications;whereas, the H-YOLO model combined the region ofinterest preselected network and textural properties toeffectively improve the object detection accuracy.

Yao et al. (2021) proposed a multiscale CNN(MSCNN) based on an EssNet backbone and a dilatedbottleneck block for extracting multiscale high-qualityfeatures. The EssNet backbone functioned to maintainthe resolution of deep feature levels and improve thefeature encoding of the multiscale objects. To effect-ively handle arbitrarily-orientated objects, Zhou,Zhang, Gao, et al. (2020a) developed a rotated featurenetwork (RFN), which generated rotation-aware fea-tures to delineate orientated objects and rotation-invariant features to conduct object recognition.

2 Y. YU ET AL.

Bao et al. (2019) designed a single-shot anchor refine-ment network (S2ARN) for detecting orientatedobjects with the assistance of orientated boundingboxes. In this model, two types of regressions wereapplied to, respectively, generate refined anchors andaccurate bounding boxes. Differently, Shi et al. (2020)constructed a one-stage anchor-free network to detectarbitrarily-orientated vehicles. In this network, a fea-ture pyramid fusion strategy was applied to concaten-ate the multi-stage features for the direct regressionand identification of vehicles. Similarly, Zhang, Wang,et al. (2020) proposed an anchor-free network fordetecting rotated ships. This network consisted of afeature extraction backbone integrated with a selectiveconcatenation module, a rotation Gaussian-maskmodule for modeling the geometric features of ships,and a detection module for ship detection and rotatedbounding box regression. Li, Pei, et al. (2020) alsopresented a single-stage anchor-free detector based ona multiscale dense path aggregation feature pyramidnetwork (DPAFPN). The DPAFPN performed promis-ingly in comprehensively considering high-levelsemantic information and low-level location informa-tion and avoiding information loss during shallow fea-ture transfer. Wu et al. (2018) learned a regularizedCNN to extract multiscale and rotation-insensitiveconvolutional channel features. These features werefinally fed into an outlier removal assisted AdaBoostclassifier for object recognition. To effectively tacklesmall-size objects, Courtrai et al. (2020) designed agenerative adversarial network (GAN). Specifically, asuper-resolution technique with an object-focusedstrategy was applied to highlight the details of thesmall-size objects. Likewise, by combining super-resolution and edge enhancement techniques, Rabbiet al. (2020) proposed an edge-enhanced super-resolution GAN (EESRGAN) and used varyingdetector networks in an end-to-end manner for objectdetection. Mekhalfi et al. (2019) designed a capsulenetwork for detecting objects in unmanned aerialvehicle (UAV) images. The capsule network com-prised a conventional convolutional layer and acapsule convolutional layer for capsule feature extrac-tion, and a fully-connected capsule layer for objectrecognition. In addition, Siamese graph embeddingnetwork (SGEN) (Tian et al. 2020), feature-mergedsingle-shot detection network (FMSSD) (Wang et al.2020), single-shot multiscale feature fusion network(Zhuang et al. 2019), single-shot recurrent networkwith activated semantics (Chen et al. 2018), fully con-volutional network (FCN) (Cozzolino et al. 2017),region-enhanced CNN (RECNN) (Lei et al. 2020), and

Bayesian transfer learning (Zhou, Zhang, Liu et al.2020b) were also leveraged to detect geospa-tial objects.

In this paper, we develop a novel one-stage anchor-free capsule network for detecting geospatial objectsfrom remote sensing images. With the formulation ofa capsule feature pyramid network architecture as thebackbone for extracting multiscale high-order capsulefeatures, the integration of the capsule-based channeland spatial feature attention modules for obtaininginformative feature encodings, and the design of aneffective anchor-free object detection strategy, the pro-posed network performs promisingly in handling geo-spatial objects of different scales, orientations,distributions, and surface conditions in diverse com-plicated scenarios. The contributions of this paperinclude the following: (1) two types of capsule featureattention modules are proposed to emphasize chan-nel-wise informative features and class-specific spatialfeatures to produce high-quality object-orientated fea-ture representations; (2) an effective centreness-assisted one-stage anchor-free object detection strategyis designed to recognize arbitrarily-orientated andvarying-scale geospatial objects.

Methodology

Capsule network

Traditional deep learning models are generallydesigned with scalar neurons for encoding the featuresaliencies and probabilities. In contrast, constructedwith vectorial capsules, capsule networks use the cap-sule length to encode the feature probability of anentity and the instantiation parameters of a capsule todepict the inherent features of the entity (Sabour et al.2017). An advantageous property of the capsule for-mulation is that the vectorial representation allows acapsule not only to detect a feature but also to learnand identify its variants. That is, a category of entityfeatures can be encoded by using capsules rather thana single feature encoding pattern like that in theCNN. Capsule convolutions operate quite differentlyfrom traditional convolutions. Specifically, for a cap-sule j, the input to the capsule is a weighted aggrega-tion over all the predictions from the capsules withinthe convolution kernel in the previous layer as fol-lows:

Cj ¼X

iai, j � Ui, j (1)

where Cj is the aggregated input to capsule j; ai,j is acoupling coefficient reflecting the contribution of cap-sule i to capsule j, which is dynamically determined

CANADIAN JOURNAL OF REMOTE SENSING 3

by the improved dynamic routing process(Rajasegaran et al. 2019); Ui,j is the prediction fromcapsule i to capsule j, which is computed as follows:

Ui, j ¼ Wi, jUi (2)

where Ui is the output of capsule i and Wi,j is a trans-formation matrix acting as a feature map-ping function.

As for the capsule length-based feature probabilityencoding pattern in the capsule networks, the longerthe capsules are, the higher the probability predictionsshould be. To this end, a squashing function (Sabouret al. 2017) is designed as the activation function tonormalize the output of a capsule. The squashingfunction is formulated as follows:

Uj ¼kCjk2

1þ kCjk2� Cj

kCjk (3)

where Cj and Uj are, respectively, the input and theoutput of capsule j. The modulus of a vector is calcu-lated by the operator k � k for representing the capsulelength. Through the above normalization process, longcapsules are shrunk to a length close to one to casthigh predictions; whereas short capsules are suppressedto almost a zero length to provide few contributions.

One-stage anchor-free capsule network

As shown in Figure 1, based on capsule representa-tions, we construct a fully convolutional one-stageanchor-free capsule network (OA-CapsNet) for detect-ing geospatial objects. The architecture of the OA-CapsNet involves a feature extraction backbone and aset of parallel multiscale object detection heads. Byemploying a feature pyramid network architecture, thefeature extraction backbone serves to extract high-quality, multiscale capsule features. The object detec-tion head is designed with a one-stage anchor-freestrategy for directly detecting and regressing objects,which avoids the tedious work in anchor determin-ation and proposal generation, as well as improvingthe detection efficiency.

The feature extraction backbone comprises twotraditional convolutional layers for extracting low-levelimage features, and a pyramid of capsule convolu-tional layers for extracting multiscale high-level entityfeatures. For the traditional convolutional layers, therectified linear unit (ReLU) is adopted as the activa-tion function. The scalar features output by thesecond traditional convolutional layer are transformedinto vectorial capsule representations to constitute theprimary capsule layer for further characterizing entity

features. This can be achieved through traditionalconvolution operations, followed by feature channelgrouping and capsule vectoring. The capsule convolu-tional layers are split into four network stages by threecapsule max-pooling layers to extract capsule featuresat different scales with a scaling step of two.Specifically, within each stage, the feature maps main-tain the same spatial resolution and size. The spatialsize of feature maps is gradually scaled down stage bystage to produce lower-resolution, but semanticallyhigher-level, feature maps. For each stage, the featuremap of the top layer, which encodes the strongest andthe most representative feature semantics, is selectedand modulated with a 1� 1 capsule convolution toform a reference feature map for further featurefusion and augmentation. As shown in Figure 1, theset of reference feature maps obtained from the fourstages is denoted by {G1, G2, G3, G4}. Then, through aseries of operations, including capsule deconvolutionsto upsample a high-level reference feature map to itstwice spatial size, capsule feature concatenations toconcatenate the upsampled high-level reference featuremap with a low-level reference feature map from theprevious stage, and capsule convolutions to conductfeature fusion, these multiscale reference feature mapsare gradually fused in a top-down manner to generatea set of multiscale fused feature maps {D1, D2, D3}. Forinstance, feature map G4 is first upsampled to its twicespatial size to hallucinate a higher-resolution featuremap through capsule deconvolution operations. Then,the upsampled feature map is concatenated with featuremap G3 through the lateral connection. Afterward, a3� 3 capsule convolution is performed on the con-catenated feature maps to conduct feature fusion,resulting in the multiscale fused feature map D3. Thisprocess repeats downward to gradually fuze all the fea-ture maps {G1, G2, G3, G4}. Finally, a 3� 3 capsuleconvolution is applied to {D1, D2, D3, G4} to furthersmooth the fused features to obtain the multiscale fea-ture maps {P1, P2, P3, P4}, which have scales of {1, 1/2,1/4, 1/8} with regard to the input image and encodestrong feature semantics at each scale. In this way, thehigh-resolution features in the lower stages are effect-ively augmented by the semantically strong features inthe higher stages to provide a set of high-quality featuremaps at multiple scales.

Specifically, to enhance the feature representationcapability at each scale, we design a capsule-basedchannel feature attention (CFA) module and integrateit at the end of each stage to boost the quality of thereference feature map. The architecture of the CFAmodule is inspired by the squeeze-and-excitation

4 Y. YU ET AL.

network (Hu et al. 2018) and leverages a capsule-based formulation. The CFA module aims to upgradethe input features by exploiting the channel-wiseinterdependencies to increase the ability of the net-work to highlight informative feature channels associ-ated with the foreground and suppress the impacts ofthe helpless or less informative feature channels. To

this end, as shown in Figure 2, first, the input multi-dimensional capsule feature map is converted into aone-dimensional capsule feature map A, which main-tains the same number of feature channels, as well asthe same spatial size, and mainly encodes the featuresaliencies of the input feature map, through a 1� 1capsule convolution. Then, a channel descriptor C is

Figure 1. Architecture of the proposed one-stage anchor-free capsule network (OA-CapsNet). The dimension of a capsule is config-ured as 16.


obtained by performing a global average pooling onfeature map A in a channel-wise manner. That is, ascalar value reflecting the global feature statistics in afeature channel is generated by spatially averaging thefeatures in this channel. Finally, two fully-connectedlayers are connected to exploit channel-wise interde-pendencies. These two fully-connected layers are,respectively, activated by the ReLU and the sigmoidfunction. The output of the second fully-connectedlayer produces a channel-wise attention descriptor R,whose length equals to the number of channels of theinput feature map and each of whose entries encodesthe importance of the associated channel of the inputfeature map. That is, the larger the value of an entry,the more important the associated channel of theinput feature map. Whereas, the smaller the value ofan entry, the less informative the associated channelof the input feature map. The attention descriptor Ris used as weight factors to recalibrate the input fea-ture map to emphasize the informative and salientfeatures and weaken the contributions of the lessimportant ones. This is achieved by multiplying Rwith the input feature map in a channel-wise andelement-wise manner as follows:

�Uij ¼ ri � Ui

j , i ¼ 1, 2, :::, 64 (4)

where ri represents the i-th entry of the attention

descriptor R; Uij and �Ui

j are, respectively, the outputof the original capsule and the output of the recali-brated capsule in the i-th channel. In this way, theinformative feature channels are significantly high-lighted and the less informative feature channels arerationally suppressed, thereby boosting the featurerepresentation robustness.

In addition, to further enhance the feature seman-tics at each scale, we design a capsule-based spatialfeature attention (SFA) module and mount it over themultiscale feature maps {P1, P2, P3, P4}. The architec-ture of the SFA module is inspired by the dualattention network (Fu et al. 2019) and adopts a cap-sule-based formulation. The SFA module functions toenforce the network to focus on the spatial featuresassociated with the object regions and adequately sup-press the influences of the background features. Asshown in Figure 3, first, two 1� 1 capsule convolu-tions are, respectively, performed on the input multi-dimensional capsule feature map to convert it intotwo one-dimensional identical-size feature mapsB2RH�W�64 and E2RH�W�64, where H and W are,respectively, the height and width of the input featuremap. Then, feature maps B and E are reshaped alongdifferent dimensions to obtain two feature matricesX2RN�64 and Y2R64�N, where N¼W�H denotesthe number of positions in the input feature map.Next, we multiply X with Y (i.e. XY) and activateeach element of the product matrix by a softmaxfunction in a column manner to constitute a spatialattention matrix S2RN�N. The element si,j at row i,column j of the spatial attention matrix S encodes theimpact of position i on position j in the input featuremap. The computation of the element si,j is as follows:

si, j ¼exp ðP64

k¼1 xi, k � yk, jÞPNm¼1 exp ðP64

k¼1 xm, k � yk, jÞ(5)

Figure 3. Architecture of the spatial feature attention (SFA) module.

Figure 2. Architecture of the channel feature attention(CFA) module.

6 Y. YU ET AL.

where xi,k denotes the element at row i, column k ofthe feature matrix X and yk,j denotes the element atrow k, column j of the feature matrix Y. Finally, theinput feature map is reshaped to a capsule featurematrix T2RN� 64� 16 and multiplied with the spatialattention matrix S (i.e. ST), followed by a reshapingoperation to reshape the product matrix to the dimen-sion RH�W� 64� 16, to produce a high-quality spatialfeature highlighted feature map. In this way, the spa-tial features associated with the object regions arepositively concentrated on and the background fea-tures are effectively weakened, thereby improving theobject-level feature characterization quality. As shownin Figure 1, the multiscale class-specific feature maps{F1, F2, F3, F4} output by the SFA modules are fedinto the object detection head to conduct object infer-ence. This set of feature maps comprehensively takeinto account both the channel and spatial featureinformativeness, as well as the multiscale featuresemantics, to provide semantically strong feature rep-resentations at multiple scales.

As shown in Figure 1, the object detection heademploys a shallow capsule convolutional network withthree parallel output branches for, respectively, identi-fying the presence of an object, inferring the centre-ness of a position, and regressing the orientatedbounding box of an object. The classification branchoutputs a (Cþ 1)-dimensional softmax vector at eachposition for recognizing the C categories of objectsand the background. That is, the outputs at each pos-ition of the classification branch are activated by thesoftmax function to generate a one-hot prediction. Asa result, the category corresponding to the maximumsoftmax output is assigned as the predicted categorylabel at a position. Specifically, to effectively handlearbitrarily-orientated objects, we use a five-tuple rep-resentation {d1, d2, d3, d4, h} to characterize an objectat a position. As shown in Figure 4a, {d1, d2, d3, d4}represent the distances from a position inside theobject region to the four sides of the object’s bound-ing box, and h2 [0,p) represents the orientation of the

object, which is defined as the included angle fromthe positive direction of the x-axis to the directionparallel to the long side of the object’s bounding boxalong the anticlockwise direction. Based on such rep-resentation, the regression branch outputs a five-dimensional vector at each position for encoding theorientated bounding box of an object. Note that, sincethe object scales vary greatly in the multiscale featuremaps, thus, rather than directly regressing the large-range parameters {d1, d2, d3, d4}, we adopt the follow-ing transformations to restrain them to a small range:

di ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiW2

F þ H2F

p2

ebi , i ¼ 1, 2, 3, 4 (6)

where WF and HF are, respectively, the width andheight of a feature map used for regression. As a result,the regression branch only requires to output the small-range values bi,i¼ 1, 2, 3, 4 in the feature map domain.The predicted regression parameters in the imagedomain can be computed using Equation (6). In add-ition, to effectively focus on the high-quality boundingboxes near the object center and suppress the low-qual-ity bounding boxes at the positions far away from theobject center, we add an indicator to depict the centre-ness of a position, which measures the proximity of aposition to the object center. That is, the higher thecentreness measure at a position, the closer the positionto the object center. Thus, the centreness branch out-puts a scalar value (activated by the sigmoid function)at each position for inferring the centreness of the pos-ition, which is used to weight the corresponding object-ness score of the classification branch. Concretely, thecertainty of the existence of an object at a position isreflected by the product of the softmax output from theclassification branch and the centreness indicator fromthe centreness branch.

To construct a high-quality object detection model,at the training stage, the positive samples are selectedas the positions located at the central area surroundedby the object’s bounding box with a ratio of d¼ 0.8(Figure 4b). The remaining marginal area containing

Figure 4. Illustrations of (a) the five-tuple representation of an arbitrarily-orientated object at a position, and (b) the positive andnegative regions used for training.


low-quality object information is directly ignored. Thebackground area is used as the negative samples.Based on the multi-branch prediction architecture ofthe OA-CapsNet, the loss function used for directingthe training process is designed as a multitask lossfunction as follows:

L ¼ k1Lcls þ k2Lreg þ k3Lcnt (7)

where Lcls, Lreg, and Lcnt are, respectively, the classifi-cation, regression, and centreness loss terms of thethree prediction branches; k1, k2, and k3 are the regu-larization factors for balancing the contributions ofthe three loss terms. To effectively handle orientatedbounding boxes, the Lreg is formulated as the general-ized intersection over union (GIoU) loss (Rezatofighiet al. 2019) between the regressed bounding boxesand the target bounding boxes. The Lcls and Lcnt areformulated as the focal loss (Lin et al. 2017) betweenthe predictions and the ground truths. Specifically,given the distance regression targets {d�1, d�2, d�3, d�4}for a position, the corresponding centreness target iscalculated accordingly as follows:

c� ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiminðd�1, d�3Þmaxðd�1, d�3Þ

� minðd�2, d�4Þmaxðd�2, d�4Þ

s(8)

Results and discussion

Datasets

We evaluated the geospatial object detection perform-ance of the proposed OA-CapsNet on the followingtwo large-scale remote sensing image datasets:GOD218 (Yu et al. 2020) and DOTA (Xia et al. 2018)datasets. The GOD218 dataset contains 22,000 imagescovering four categories of geospatial objects, includ-ing airplane, ship, vehicle, and ground track field.This dataset involves 69,207 annotated instanceslabeled with orientated bounding boxes. All theimages have the same image size of 800� 600 pixels.The DOTA dataset comprises 2086 images coveringfifteen categories of geospatial objects, includingvehicle, airplane, ship, harbor, bridge, tennis court,etc. This dataset involves 188,282 annotated instanceslabeled with arbitrary quadrilaterals. The image sizesrange from about 800� 800 pixels to 4000� 4000 pix-els. These two datasets are remarkably challenging,since the images were captured using different sensorsand platforms and the geospatial objects exhibit withvarying scales and orientations, diverse appearancesand distributions, different-level occlusion and shadowcontaminations, and complicated environmental scen-arios. At the training stage, the training sets of these

two datasets were applied to construct the proposedOA-CapsNet.

Network training

The proposed OA-CapsNet was trained in an end-to-end manner by backpropagation and stochastic gradi-ent descent on a cloud computing platform with ten16-GB GPU, one 16-core CPU, and a memory size of64GB. Before training, we randomly initialized alllayers of the OA-CapsNet by drawing parametersfrom a zero-mean Gaussian distribution with a stand-ard deviation of 0.01. Each training batch containedtwo images per GPU and was trained for 1000 epochs.During training, we configured the initial learningrate as 0.001 for the first 800 epochs and decreased itto 0.0001 for the rest 200 epochs. The momentumand weight decay were configured as 0.9 and 0.0005,respectively. To trade off the computational efficiencyand the feature representation capability, as well asthe object detection accuracy, we configured thedimension of a capsule as 16 for all capsule layers andthe regularization factors k1¼1, k2¼1, and k3¼1.

Geospatial object detection

At the test stage, we applied the OA-CapsNet to thetest sets of the GOD218 and the DOTA datasets toevaluate its object detection performance. To providequantitative evaluations on the object detectionresults, we adopted the following three commonlyused evaluation metrics: precision (P), recall (R), andFscore. Precision and recall, respectively, evaluate theperformance of an object detection model in distin-guishing false alarms and identifying true targets.Fscore provides an overall performance evaluation bytaking into account both the precision and recallmeasures. These three quantitative evaluation metricsare formally defined as follows:

P ¼ ðTPÞðTPÞ þ ðFPÞ (9)

R ¼ ðTPÞðTPÞ þ ðFNÞ (10)

Fscore ¼ 2� P � RP þ R

(11)

where TP, FP, and FN are the numbers of true posi-tives, false positives, and false negatives, respectively.The geospatial object detection results obtained by theproposed OA-CapsNet on the two test datasets arereported in Table 1 by using the above three quantita-tive evaluation metrics. In Table 1, “Average” denotes

8 Y. YU ET AL.

the average performance obtained on the two testdatasets. It includes the average precision, averagerecall, and average Fscore.

As reported in Table 1, the proposed OA-CapsNetachieved quite promising performances on the twotest datasets. An object detection accuracy with a pre-cision, a recall, and an Fscore of 0.9687, 0.9276, and0.9477, respectively, was obtained on the GOD218dataset. For the DOTA dataset, an accuracy with aprecision, a recall, and an Fscore of 0.9563, 0.9180, and0.9368, respectively, was achieved in detecting geospa-tial objects. Comparatively, the DOTA datasetinvolved more categories of geospatial objects withmore varying and complicated conditions, thus, a rela-tively better performance was obtained on theGOD218 dataset. Specifically, for each dataset, the pre-cision metric was better than the recall metric. Itmeans that the proposed OA-CapsNet behaved prom-isingly in distinguishing the true targets and the falsealarms, thereby resulting in a low false recognitionrate. Overall, the object detection performance wasquite competitive in processing the two remarkablychallenging datasets. An average object detectionaccuracy with a precision, a recall, and an Fscore of0.9625, 0.9228, and 0.9423, respectively, was achievedon the two test datasets.

The challenging scenarios of these two datasetscover the following aspects: (1) objects with varyingscales and orientations, (2) objects with diverse textureproperties and spatial distributions, (3) objects withdifferent levels of occlusions, (4) objects contaminatedby different levels of shadows, (5) similarities betweenthe objects of interest and the non-targets, (6) illumin-ation condition variations, and (7) complicated envir-onmental conditions. These challenging scenariosimpeded the highly accurate recognition of the geo-spatial objects, and required that the object detectionmodel should be self-adaptive, robust, effective, andtransferable enough to correctly identify the existenceof objects, accurately locate the positions of objects,effectively distinguish the true targets and false alarms,and applicably handle different image sources.Fortunately, the proposed OA-CapsNet still performedpromisingly with a high detection accuracy in process-ing the geospatial objects of varying conditions indiverse scenarios. The advantageous performance ofthe proposed OA-CapsNet benefited from the follow-ing aspects. First, by employing a capsule featurepyramid network architecture as the backbone, theproposed OA-CapsNet can extract and fuze multiscaleand multilevel high-order capsule features to providea semantically strong feature representation at eachscale. Second, by integrating the two types of capsule-based feature attention modules, the proposedOA-CapsNet can highlight channel-wise informativefeatures and focus on class-specific spatial features tofurther enhance the feature representation quality androbustness. Last but not least, by designing a centre-ness-assisted anchor-free object detection network, theproposed OA-CapsNet can detect arbitrarily-orien-tated and varying-scale objects.

For visual inspections, Figure 5 also presents a sub-set of geospatial object detection results from the twotest datasets. As observed by the object detectionresults in Figure 5, the objects of different scales, arbi-trary orientations, diverse densities, and varying distri-butions in different environmental scenarios wereeffectively recognized. Specifically, for the images con-taining very high-density and parallel-distributedvehicles and ships, the proposed OA-CapsNet stillachieved a competitive detection performance owingto the design of the five-tuple based orientated bound-ing box representation. In addition, for the imagescontaining different-scale objects, especially the small-size objects (e.g. ships, vehicles, airplanes, storagetanks, etc.), the proposed OA-CapsNet still performedpromisingly in correctly identifying and locating theseobjects due to the semantically strong pyramidal

Table 1. Geospatial object detection results obtained by dif-ferent methods.

Method Dataset

Quantitative evaluation

Precision Recall FscoreOA-CapsNet GOD218 0.9687 0.9276 0.9477

DOTA 0.9563 0.9180 0.9368Average 0.9625 0.9228 0.9423

OA-CapsNet-CFA GOD218 0.9522 0.9213 0.9365DOTA 0.9457 0.9121 0.9286Average 0.9490 0.9167 0.9326

OA-CapsNet-SFA GOD218 0.9486 0.9192 0.9337DOTA 0.9391 0.9105 0.9246Average 0.9439 0.9149 0.9292

OA-CapsNet-Light GOD218 0.9259 0.8977 0.9116DOTA 0.9168 0.8913 0.9039Average 0.9214 0.8945 0.9078

MS-VAN GOD218 0.9036 0.8769 0.8900DOTA 0.8931 0.8724 0.8826Average 0.8984 0.8747 0.8863

SOON GOD218 0.9127 0.8857 0.8990DOTA 0.9032 0.8816 0.8923Average 0.9080 0.8837 0.8957

MSCNN GOD218 0.9459 0.9173 0.9314DOTA 0.9362 0.9081 0.9219Average 0.9411 0.9127 0.9267

RFN GOD218 0.9398 0.9118 0.9256DOTA 0.9315 0.9044 0.9177Average 0.9357 0.9081 0.9217

FMSSD GOD218 0.9315 0.9052 0.9182DOTA 0.9238 0.8963 0.9098Average 0.9277 0.9008 0.9140

RECNN GOD218 0.9223 0.8946 0.9082DOTA 0.9131 0.8875 0.9001Average 0.9177 0.8911 0.9042


feature representations used for object detection atmultiple scales. Moreover, benefited from the high-quality, informative, and object-orientated featureencodings upgraded by the capsule-based channel andspatial feature attention modules at each scale, somegeospatial objects partially occluded by the nearbyhigh-rise objects or covered with dark shadows werealso effectively detected with a quite low misidentifica-tion rate. However, as shown by the green boxes inFigure 6, some geospatial objects were severelyoccluded by the nearby objects, leading to the extremeincompleteness in the remote sensing images andquite few feature presences in the feature maps. As aresult, these objects were failed to be correctly identi-fied. In addition, some land covers exhibited quitesimilar geometric and textural properties to the geo-spatial objects. Consequently, they were falsely recog-nized as the true targets. Moreover, some objects werecovered with large-area heavy dark shadows or of

extremely small sizes, making the objects hide into thebackground. Unfortunately, these objects were treatedas the background and failed to be correctly detected.On the whole, the proposed OA-CapsNet achieved anacceptable performance in detecting geospatial objectsof different conditions in diverse scenarios.

Ablation studies

As ablation studies, we further examined the perform-ance improvement achieved by integrating the CFAand SFA modules into the pyramidal feature extrac-tion backbone for enhancing the feature representa-tion robustness and informativeness at multiple scales.Specifically, the CFA module functioned to exploitand highlight the channel-wise informative featuresrelated to the foreground and weaken the contribu-tions of the background features. The SFA moduleserved to concentrate on the class-specific spatial

Figure 6. Illustration of some special challenging scenarios of the geospatial objects.

Figure 5. Illustration of a subset of geospatial object detection results from the GOD218 and DOTA datasets.

10 Y. YU ET AL.

features associated with the object regions and sup-press the impacts of the background areas. To thisend, we constructed three modified networks on thebasis of the OA-CapsNet. First, we removed allthe SFA modules from the OA-CapsNet, leaving onlythe CFA modules, and named the resultant networkas OA-CapsNet-CFA. Then, we removed all the CFAmodules from the OA-CapsNet, leaving only the SFAmodules, and named the resultant network as OA-CapsNet-SFA. Finally, we removed all the CFA andSFA modules from the OA-CapsNet, and named theresultant network as OA-CapsNet-Light.

For fair comparisons, the same training sets of theGOD218 and DOTA datasets and the same trainingstrategy were leveraged to train these three networks.Once the network parameters of these three networkswere fine-tuned, we applied them to the test sets ofthe GOD218 and DOTA datasets to evaluate theirobject detection performances. The detailed objectdetection results measured using the precision, recall,and Fscore metrics are reported in Table 1. Obviously,without the integration of the CFA and SFA modules,the object detection accuracies of the OA-CapsNet-Light were dramatically degraded on both of the twotest datasets. The accuracy degradation with regard tothe average Fscore on the two test datasets was about0.0345. In contrast, by integrating the CFA or theSFA modules into the multiscale feature extractionbackbone, the quality of the output features was sig-nificantly upgraded to positively support the objectsof interest, therefore, with the high-quality featurerepresentations fed into the object detection heads,the object detection performance was improved bythe OA-CapsNet-CFA and OA-CapsNet-SFA.Comparatively, the OA-CapsNet-CFA behaved rela-tively better than the OA-CapsNet-SFA with a

performance upgradation of about 0.0034 with regardto the average Fscore. As a conclusion, both of theCFA and SFA modules contributed positively andeffectively to the enhancement of the feature represen-tation quality and the improvement of the objectdetection performance. Therefore, with the integrationof the CFA and SFA modules to highlight both thechannel-wise informative features and the class-spe-cific spatial features associated with the foreground,the OA-CapsNet showed advantageous performancein processing geospatial objects of varying conditionsin diverse scenarios.

Comparative studies

To further evaluate the feasibility and performance ofthe proposed OA-CapsNet, we conducted a set ofcomparative studies with the following six recentlydeveloped one-stage object detection methods: MS-VAN (Wang et al. 2019), SOON (Qin et al. 2021),MSCNN (Yao et al. 2021), RFN (Zhou, Zhang, Gao,et al. 2020a), FMSSD (Wang et al. 2020), and RECNN(Lei et al. 2020). Specifically, the MS-VAN leveragedmultiscale features and attention mechanisms to pro-duce high-quality feature encodings used for objectdetection. The SOON exploited spatial properties via areceptive field enhancement module to protrude thefeature semantics of small-size objects. The MSCNNdesigned an effective feature extraction backbone toobtain strong feature representations at multiplescales. The RFN focused on the extraction of orienta-tion-aware feature maps to boost the recognition ofarbitrarily-orientated objects. The FMSSD extractedand fused spatial contextual features in both multiplescales and the same scales to well handle varying-sizegeospatial objects. The RECNN introduced a saliency

Figure 7. Illustration of ship detection results obtained by different models. (a) Test image, (b) the proposed OA-CapsNet, (c) MS-VAN, (d) SOON, (e) MSCNN, (f) RFN, (g) FMSSD, and (h) RECNN.


constraint and multilayer fusion strategy to strengthenthe feature saliencies of the object regions.

In our experiments, for fair comparisons, the sametraining sets of the GOD218 and the DOTA datasetswere used to train these models. Once the modelswere constructed, we applied them to the test sets ofthese two datasets to evaluate their object detectionperformances. The quantitative evaluations obtainedby these models on the object detection results arereported in Table 1. As reflected in Table 1, theMSCNN, RFN, and FMSSD showed superior perform-ances than the RECNN, MS-VAN and SOON.Specifically, the object detection performance of theMSCNN was higher than that of the MS-VAN byabout 0.0404 with regard to the average Fscore. Theadvantageous performances of the MSCNN, RFN, andFMSSD benefited from the exploitation of effectivemechanisms to take advantage of multiscale or multi-orientated features to improve the feature representa-tion informativeness and robustness. Thus, theyperformed promisingly in correctly recognizing thegeospatial objects of varying conditions in diversescenarios. Comparatively, by designing a capsule fea-ture pyramid network architecture as the featureextraction backbone and integrating the two types ofcapsule feature attention modules to provide multi-scale semantically strong and informative features, aswell as the effective centreness-assisted anchor-freeobject detection strategy, the proposed OA-CapsNetshowed competitive and superior performance overthe six compared methods. For visual inspections andcomparisons, Figure 7 presents some examples of shipdetection results obtained by using these models. Asshown by Figures 7c–7h, some ships of extremelysmall sizes and some ships distributed parallelly andclosely were not correctly recognized. In contrast, allthe ships of varying conditions were successfullydetected by the proposed OA-CapsNet. Thus, throughcontrastive analysis, we concluded that the proposedOA-CapsNet provided a feasible and effective solutionto geospatial object detection tasks.

Conclusion

This paper has presented a novel one-stage anchor-free capsule network, named OA-CapsNet, for geospa-tial object detection from remote sensing images.Formulated with a capsule feature pyramid networkarchitecture as the backbone, the proposed OA-CapsNet can extract and fuze multilevel andmultiscale high-order capsule features to provide ahigh-quality, semantically strong feature encoding at

each scale. Integrated with two types of capsule fea-ture attention modules, the proposed OA-CapsNetperformed effectively in highlighting the channel-wiseinformative features and focusing on the class-specificspatial features to further enhance the feature repre-sentation quality and robustness. Designed with a cen-treness-assisted anchor-free object detection strategy,the proposed OA-CapsNet served to effectively recog-nize arbitrarily-orientated and diverse-scale geospatialobjects. The proposed OA-CapsNet has been inten-sively evaluated on two large remote sensing imagedatasets toward geospatial object detection.Quantitative evaluations showed that a competitiveoverall performance with a precision, a recall, and anFscore of 0.9625, 0.9228, and 0.9423, respectively, wasachieved in handling geospatial objects of varyingconditions in diverse environmental scenarios.Comparative studies with a set of recently developeddeep learning methods also confirmed the applicabil-ity and effectiveness of the proposed OA-CapsNet ingeospatial object detection tasks. However, due tosevere occlusions, poor illumination conditions, andextremely small sizes of some geospatial objects, it isstill challengeable to fulfill high-performance geospa-tial object detection. In our future work, we willdevelop part-based models to improve the detectionof occluded objects, design more powerful featureattention mechanisms to highlight the low-contrastobjects, and exploit high-resolution network architec-tures or super-resolution techniques to effectively han-dle small-size objects.

Disclosure statement

No potential conflict of interest was reported bythe author(s).

Funding

This work was supported by the [National Natural ScienceFoundation of China] under Grants [62076107],[51975239], [41971414], and [41671454]; by the [Six TalentPeaks Project in Jiangsu Province] under Grant [XYDXX-098]; and by the [Natural Science Foundation of JiangsuProvince] under Grant [BK20191214].

ORCID

Yongtao Yu http://orcid.org/0000-0001-7204-9346Haiyan Guan http://orcid.org/0000-0003-3691-8721Dilong Li http://orcid.org/0000-0002-5826-5568Jonathan Li http://orcid.org/0000-0001-7899-0049

12 Y. YU ET AL.

References

Bao, S., Zhong, X., Zhu, R., Zhang, X., Li, Z., and Li, M.2019. “Single shot anchor refinement network for ori-ented object detection in optical remote sensing imagery.”IEEE Access., Vol. 7: pp. 87150–87161. doi:10.1109/ACCESS.2019.2924643.

Chen, S., Zhan, R., and Zhang, J. 2018. “Geospatial objectdetection in remote sensing imagery based on multiscalesingle-shot detector with activated semantics.” RemoteSensing, Vol. 10(No. 6): pp. 820. doi:10.3390/rs10060820.

Courtrai, L., Pham, M.T., and Lef�evre, S. 2020. “Small objectdetection in remote sensing images based on super-reso-lution with auxiliary generative adversarial networks.”Remote Sensing, Vol. 12(No. 19): pp. 3152. doi:10.3390/rs12193152.

Cozzolino, D., Martino, G.D., Poggi, G., and Verdoliva, L.2017. “A fully convolutional neural network for low-com-plexity single-stage ship detection in Sentinel-1 SARimages.” Paper presented at the IEEE InternationalGeoscience and Remote Sensing Symposium, Fort Worth,USA, July 2017.

Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H.2019. “Dual attention network for scene segmentation.”Paper presented at the IEEE Conference on ComputerVision and Pattern Recognition, Long Beach, USA, June2019.

Gong, Y., Xiao, Z., Tan, X., Sui, H., Xu, C., Duan, H., andLi, D. 2020. “Context-aware convolutional neural networkfor object detection in VHR remote sensing imagery.”IEEE Transactions on Geoscience and Remote Sensing,Vol. 58(No. 1): pp. 34–44. doi:10.1109/TGRS.2019.2930246.

Hu, Y., Li, X., Zhou, N., Yang, L., Peng, L., and Xiao, S.2019. “A sample update-based convolutional neural net-work framework for object detection in large-area remotesensing images.” IEEE Geoscience and Remote SensingLetters, Vol. 16(No. 6): pp. 947–951. doi:10.1109/LGRS.2018.2889247.

Hu, J., Shen, L., and Sun, G. 2018. “Squeeze-and-excitationnetworks.” Paper presented at the IEEE Conference onComputer Vision and Pattern Recognition, Salt LakeCity, USA, June 2018.

Lei, J., Luo, X., Fang, L., Wang, M., and Gu, Y. 2020.“Region-enhanced convolutional neural network forobject detection in remote sensing images.” IEEETransactions on Geoscience and Remote Sensing, Vol.58(No. 8): pp. 5693–5702. doi:10.1109/TGRS.2020.2968802.

Li, Q., Mou, L., Xu, Q., Zhang, Y., and Zhu, X.X. 2019.“R3-Net: A deep network for multioriented vehicle detec-tion in aerial images and videos.” IEEE Transactions onGeoscience and Remote Sensing, Vol. 57(No. 7): pp.5028–5042. doi:10.1109/TGRS.2019.2895362.

Li, Y., Pei, X., Huang, Q., Jiao, L., Shang, R., and Marturi,N. 2020. “Anchor-free single stage detector in remotesensing images based on multiscale dense path aggrega-tion feature pyramid network.” IEEE Access., Vol. 8: pp.63121–63133. doi:10.1109/ACCESS.2020.2984310.

Li, K., Wan, G., Cheng, G., Meng, L., and Han, J. 2020.“Object detection in optical remote sensing images: Asurvey and a new benchmark.” ISPRS Journal of

Photogrammetry and Remote Sensing, Vol. 159: pp.296–307. doi:10.1016/j.isprsjprs.2019.11.023.

Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll�ar, P.2017. “Focal loss for dense object detection.” Paper pre-sented at the IEEE International Conference onComputer Vision, Venice, Italy, October 2017.

Liu, W., Ma, L., Wang, J., and Chen, H. 2019. “Detection ofmulticlass objects in optical remote sensing images.”IEEE Geoscience and Remote Sensing Letters, Vol. 16(No.5): pp. 791–795. doi:10.1109/LGRS.2018.2882778.

Liu, Q., Xiang, X., Yang, Z., Hu, Y., and Hong, Y. 2021.“Arbitrary direction ship detection in remote-sensingimages based on multitask learning and multiregion fea-ture fusion.” IEEE Transactions on Geoscience andRemote Sensing, Vol. 59(No. 2): pp. 1553–1564. doi:10.1109/TGRS.2020.3002850.

Mandal, M., Shah, M., Meena, P., Devi, S., and Vipparthi,S.K. 2020. “AVDNet: A small-sized vehicle detection net-work for aerial visual data.” IEEE Geoscience and RemoteSensing Letters, Vol. 17(No. 3): pp. 494–498. doi:10.1109/LGRS.2019.2923564.

Mekhalfi, M.L., Bejiga, M.B., Soresina, D., Melgani, F., andDemir, B. 2019. “Capsule networks for object detectionin UAV imagery.” Remote Sensing, Vol. 11(No. 14):pp.1694. doi:10.3390/rs11141694.

Pham, M.T., Courtrai, L., Friguet, C., Lef�evre, S., andBaussard, A. 2020. “YOLO-fine: One-stage detector ofsmall objects under various backgrounds in remote sens-ing images.” Remote Sensing, Vol. 12(No. 15): pp. 2501.doi:10.3390/rs12152501.

Qin, H., Li, Y., Lei, J., Xie, W., and Wang, Z. 2021. “A spe-cially optimized one-stage network for object detection inremote sensing images.” IEEE Geoscience and RemoteSensing Letters, Vol. 18 (No. 3): pp. 401–405. doi:10.1109/LGRS.2020.2975086.

Rabbi, J., Ray, N., Schubert, M., Chowdhury, S., and Chao,D. 2020. “Small-object detection in remote sensingimages with end-to-end edge-enhanced GAN and objectdetector network.” Remote Sensing, Vol. 12(No. 9): pp.1432. doi:10.3390/rs12091432.

Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara,H., and Seneviratne, S.R.R. 2019. “DeepCaps: Goingdeeper with capsule networks.” Paper presented at theIEEE Conference on Computer Vision and PatternRecognition, Long Beach, USA,

Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I.,and Savarese, S. 2019. Generalized intersection overunion: A metric and a loss for bounding box regression.”Paper presented at the IEEE Conference on ComputerVision and Pattern Recognition, Long Beach, USA, June2019.

Sabour, S., Frosst, N., and Hinton, G.E. 2017. “Dynamicrouting between capsules.” Paper presented at the 31stConference on Neural Information Processing Systems,Long Beach, USA.

Shi, F., Zhang, T., and Zhang, T. 2020. “Orientation-awarevehicle detection in aerial images via an anchor-freeobject detection approach.” IEEE Transactions onGeoscience and Remote Sensing. Advance online publica-tion. doi:10.1109/TGRS.2020.3011418.

Tang, G., Liu, S., Fujino, I., Claramunt, C., Wang, Y., andMen, S. 2020. “H-YOLO: A single-shot ship detection


https://doi.org/10.1109/ACCESS.2019.2924643


https://doi.org/10.3390/rs10060820

https://doi.org/10.3390/rs12193152

https://doi.org/10.3390/rs12193152

https://doi.org/10.1109/TGRS.2019.2930246


https://doi.org/10.1109/LGRS.2018.2889247






https://doi.org/10.1016/j.isprsjprs.2019.11.023






https://doi.org/10.3390/rs11141694

https://doi.org/10.3390/rs12152501



https://doi.org/10.3390/rs12091432


approach based on region of interest preselectednetwork.” Remote Sensing, Vol. 12(No. 24): pp. 4192. doi:10.3390/rs12244192.

Tang, T., Zhou, S., Deng, Z., Lei, L., and Zou, H. 2017.“Arbitrary-oriented vehicle detection in aerial imagerywith single convolutional neural networks.” RemoteSensing, Vol. 9(No. 11): pp. 1170. doi:10.3390/rs9111170.

Tian, S., Kang, L., Xing, X., Li, Z., Zhao, L., Fan, C., andZhang, Y. 2020. “Siamese graph embedding network forobject detection in remote sensing images.” IEEEGeoscience and Remote Sensing Letters. Advance onlinepublication. doi:10.1109/LGRS.2020.2981420.

Wang, C., Bai, X., Wang, S., Zhou, J., and Ren, P. 2019.“Multiscale visual attention networks for object detectionin VHR remote sensing images.” IEEE Geoscience andRemote Sensing Letters, Vol. 16(No. 2): pp. 310–314. doi:10.1109/LGRS.2018.2872355.

Wang, P., Sun, X., Diao, W., and Fu, K. 2020. “FMSSD:Feature-merged single-shot detection for multiscaleobjects in large-scale remote sensing imagery.” IEEETransactions on Geoscience and Remote Sensing, Vol.58(No. 5): pp. 3377–3390. doi:10.1109/TGRS.2019.2954328.

Wu, X., Hong, D., Ghamisi, P., Li, W., and Tao, R. 2018.“MsRi-CCF: Multi-scale and rotation-insensitive convolu-tional channel features for geospatial object detection.”Remote Sensing, Vol. 10(No. 12): pp. 1990. doi:10.3390/rs10121990.

Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J.,Datcu, M., Pelillo, M., and Zhang, L. 2018. “DOTA: Alarge-scale dataset for object detection in aerial images.”Paper presented at the IEEE Conference on ComputerVision and Pattern Recognition, Salt Lake City, USA,June 2018

Yao, Q., Hu, X., and Lei, H. 2021. “Multiscale convolutionalneural networks for geospatial object detection in VHRsatellite images.” IEEE Geoscience and Remote Sensing

Letters, Vol. 18 (No. 1): pp. 23–27. doi:10.1109/LGRS.2020.2967819.

Yu, Y., Guan, H., Li, D., Gu, T., Tang, E., and Li, A. 2020.“Orientation guided anchoring for geospatial objectdetection from remote sensing imagery.” ISPRS Journal ofPhotogrammetry and Remote Sensing, Vol. 160: pp.67–82. doi:10.1016/j.isprsjprs.2019.12.001.

Zhang, Z., Liu, Y., Liu, T., Lin, Z., and Wang, S. 2020.“DAGN: A real-time UAV remote sensing image vehicledetection framework.” IEEE Geoscience and RemoteSensing Letters, Vol. 17(No. 11): pp. 1884–1888. doi:10.1109/LGRS.2019.2956513.

Zhang, X., Wang, G., Zhu, P., Zhang, T., Li, C., and Jiao, L.2020. “GRS-Det: An anchor-free rotation ship detectorbased on Gaussian-mask in remote sensing images.” IEEETransactions on Geoscience and Remote Sensing. Advanceonline publication. doi:10.1109/TGRS.2020.3018106.

Zheng, Z., Zhong, Y., Ma, A., Han, X., Zhao, J., Liu, Y., andZhang, L. 2020. “HyNet: Hyper-scale object detectionnetwork framework for multiple spatial resolution remotesensing imagery.” ISPRS Journal of Photogrammetry andRemote Sensing, Vol. 166: pp. 1–14. doi:10.1016/j.isprsjprs.2020.04.019.

Zhou, K., Zhang, Z., Gao, C., and Liu, J. 2020a. “Rotatedfeature network for multiorientation object detection ofremote-sensing images.” IEEE Geoscience and RemoteSensing Letters, Vol. 18(No. 1): pp. 33–37. doi:10.1109/LGRS.2020.2965629.

Zhou, C., Zhang, J., Liu, J., Zhang, C., Shi, G., and Hu, J.2020b. “Bayesian transfer learning for object detection inoptical remote sensing images.” IEEE Transactions onGeoscience and Remote Sensing, Vol. 58(No. 11): pp.7705–7719. doi:10.1109/TGRS.2020.2983201.

Zhuang, S., Wang, P., Jiang, B., Wang, G., and Wang, C.2019. “A single shot framework with multi-scale featurefusion for geospatial object detection.” Remote Sensing,Vol. 11(No. 5): pp. 594. doi:10.3390/rs11050594.

14 Y. YU ET AL.

https://doi.org/10.3390/rs12244192

https://doi.org/10.3390/rs9111170





https://doi.org/10.3390/rs10121990

https://doi.org/10.3390/rs10121990












https://doi.org/10.3390/rs11050594

oa-capsnet: a one-stage anchor-free capsule network for

Documents