journal of la detecting trees in street images via deep ...3dgp.net/paper/2019/detecting trees in...

12
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Detecting Trees in Street Images via Deep Learning with Attention Module Qian Xie, Dawei Li, Zhenghao Yu, Jun Zhou, Jun Wang Abstract—Although object detection techniques have been widely employed in various practical applications, automatic tree detection is still a difficult challenge, especially for street- view images. In this paper, we propose a unified end-to-end trainable network for automatic street tree detection, based on a state-of-the-art deep learning based object detector. We tackle low illumination and heavy occlusion conditions in tree detection, which have not been extensively studied until now, due to clear challenges. Existing generic object detectors cannot be directly applied to this task due to aforementioned challenges. To address these issues, we first present a simple, yet effective image brightness adjustment method to handle low illuminance cases. Moreover, we propose a novel loss and a tree part-attention module to reduce false detections caused by heavy occlusion, inspired by the previously proposed occlusion-aware R-CNN work. We train and evaluate several versions of the proposed network and validate the importance of each component. It is demonstrated that the resulting framework, Part Attention Network for Tree Detection (PANTD), can efficiently detect trees in street view images. Experimental results show that our approach achieves high accuracy and robustness under various conditions. Index Terms—Tree detection, convolutional neural network, deep learning. I. I NTRODUCTION T HE tree has become an indispensable part for densely- built cities . The majority of the trees are planted along the roads and play an important role in the city system. They act as multifunction systems of dust and noise reduction , sun- shade for pedestrians, etc. Therefore, monitoring their health and growth is necessary [1]. The first and most important task is to figure out their quantity. The government would like to know the specific quantity of street trees in a given area and the category of tree they belong to. In the past, this problem can only be solved manually. Experts were asked to go into the street, counting and classifying trees along the road in sequence, which obviously required considerable human resources and time. Recently, leveraging the usage of street view vehicles, municipal administration companies can capture a series of street view images in a short time. These images are then sent to experts to detect trees manually via labelling tools in the computer. However, the manual labeling task is still boring and time consuming work. This made it difficult to concentrate on this task for long periods of time, which led to missing or inaccurate labels. This problem can actually be formulated as a classic object detection problem, which has been studied for decades in the computer vision field. Especially due to the application of deep learning techniques, e.g. deep convolutional neural networks, (a) Low illumination case (b) (c) Inter-class occlusion case Intra-class occlusion case Fig. 1. Examples of three main challenges for tree detection from street-view images. These problems strongly affect the performance of generic object detectors, like Faster R-CNN [2], in this task. both the accuracy and efficiency has been dramatically im- proved for general object detection in given images. However, for some specific tasks, such as pedestrian detection [3], diseased tissue detection in medical images [4] or surface crack detection in metal materials [5], [6], [7], there are still some problems to overcome. For the street tree detection task, occlusion remains one of the most significant challenges, especially in crowded scenes, as shown in Figure 1(b) and (c). The captured image may also suffer from bad lighting conditions in Figure 1(a). Occlusion generally consists of two subtypes, inter-class occlusion and intra-class occlusion. Inter-class occlusion is the situation where trees are occluded by things or objects with other categories, while intra-class occlusion occurs where trees are occluded with each other. When we directly apply several state-of-the-art general object detection frameworks, e.g. Faster R-CNN and YOLOv3, to the presented street tree image dataset, the detection result is usually found to be unsatisfactory. We find that the majority of these false detections are caused by occlusion to a great extent. Two adjacent trees along the street may even be integrated with each other due to growth and development, making it hard to tell them apart even for human eyes. There are several studies dealing with the occlusion problem in object detection task. However, none of them could be directly applied into the street tree detection application and efficiently resolve the occlusion problem between trees. On the other side, there is no existing method proposed for detecting street trees. Therefore, we de- cide to propose a tree detection network to address this issue. Specifically, we focus on addressing the occlusion problem in the street tree detection task to improve accuracy. Overall, this work aims to explore the use of deep learning based image analysis to implement a tree detection method in street view images, particularly in crowded scenarios. In addition, a simple yet effective automatic image brightness adjustment method is

Upload: others

Post on 23-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Detecting Trees in Street Images via Deep Learningwith Attention Module

Qian Xie, Dawei Li, Zhenghao Yu, Jun Zhou, Jun Wang

Abstract—Although object detection techniques have beenwidely employed in various practical applications, automatictree detection is still a difficult challenge, especially for street-view images. In this paper, we propose a unified end-to-endtrainable network for automatic street tree detection, basedon a state-of-the-art deep learning based object detector. Wetackle low illumination and heavy occlusion conditions in treedetection, which have not been extensively studied until now,due to clear challenges. Existing generic object detectors cannotbe directly applied to this task due to aforementioned challenges.To address these issues, we first present a simple, yet effectiveimage brightness adjustment method to handle low illuminancecases. Moreover, we propose a novel loss and a tree part-attentionmodule to reduce false detections caused by heavy occlusion,inspired by the previously proposed occlusion-aware R-CNNwork. We train and evaluate several versions of the proposednetwork and validate the importance of each component. Itis demonstrated that the resulting framework, Part AttentionNetwork for Tree Detection (PANTD), can efficiently detect trees instreet view images. Experimental results show that our approachachieves high accuracy and robustness under various conditions.

Index Terms—Tree detection, convolutional neural network,deep learning.

I. INTRODUCTION

THE tree has become an indispensable part for densely-built cities . The majority of the trees are planted along

the roads and play an important role in the city system. Theyact as multifunction systems of dust and noise reduction , sun-shade for pedestrians, etc. Therefore, monitoring their healthand growth is necessary [1]. The first and most importanttask is to figure out their quantity. The government wouldlike to know the specific quantity of street trees in a givenarea and the category of tree they belong to. In the past, thisproblem can only be solved manually. Experts were asked togo into the street, counting and classifying trees along the roadin sequence, which obviously required considerable humanresources and time. Recently, leveraging the usage of streetview vehicles, municipal administration companies can capturea series of street view images in a short time. These imagesare then sent to experts to detect trees manually via labellingtools in the computer. However, the manual labeling task isstill boring and time consuming work. This made it difficultto concentrate on this task for long periods of time, which ledto missing or inaccurate labels.

This problem can actually be formulated as a classic objectdetection problem, which has been studied for decades in thecomputer vision field. Especially due to the application of deeplearning techniques, e.g. deep convolutional neural networks,

(a) Low illumination case (b) (c)Inter-class

occlusion caseIntra-class

occlusion case

Fig. 1. Examples of three main challenges for tree detection from street-viewimages. These problems strongly affect the performance of generic objectdetectors, like Faster R-CNN [2], in this task.

both the accuracy and efficiency has been dramatically im-proved for general object detection in given images. However,for some specific tasks, such as pedestrian detection [3],diseased tissue detection in medical images [4] or surfacecrack detection in metal materials [5], [6], [7], there are stillsome problems to overcome. For the street tree detectiontask, occlusion remains one of the most significant challenges,especially in crowded scenes, as shown in Figure 1(b) and(c). The captured image may also suffer from bad lightingconditions in Figure 1(a). Occlusion generally consists oftwo subtypes, inter-class occlusion and intra-class occlusion.Inter-class occlusion is the situation where trees are occludedby things or objects with other categories, while intra-classocclusion occurs where trees are occluded with each other.When we directly apply several state-of-the-art general objectdetection frameworks, e.g. Faster R-CNN and YOLOv3, tothe presented street tree image dataset, the detection result isusually found to be unsatisfactory. We find that the majority ofthese false detections are caused by occlusion to a great extent.Two adjacent trees along the street may even be integrated witheach other due to growth and development, making it hard totell them apart even for human eyes. There are several studiesdealing with the occlusion problem in object detection task.However, none of them could be directly applied into the streettree detection application and efficiently resolve the occlusionproblem between trees. On the other side, there is no existingmethod proposed for detecting street trees. Therefore, we de-cide to propose a tree detection network to address this issue.Specifically, we focus on addressing the occlusion problem inthe street tree detection task to improve accuracy. Overall, thiswork aims to explore the use of deep learning based imageanalysis to implement a tree detection method in street viewimages, particularly in crowded scenarios. In addition, a simpleyet effective automatic image brightness adjustment method is

Page 2: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

also proposed to handle the low illuminance cases.In this paper, we propose a Part Attention Network for

Tree Detection (PANTD), based on the Faster R-CNN [2]detection framework to solve the problems above. Inspiredby [8], to handle partial occlusion, we propose a tree part-attention module to replace the original RoI pooling layer inthe classical Faster R-CNN detector, which integrates the priorstructure information of tree structure with visibility predictioninto the network. That is, we partition the tree region into threeparts, and learn their corresponding weights with the classi-fication results on the feature map. Specifically, we treat thevisibility score of each part as the weight parameter, predictedvia the learned sub-network, to combine the extracted featuresof different parts for tree detection. In this way, our networkcan focus on the visual part of trees, while suppressing theuseless feature map of occluded parts in the classificationtask. In addition, to further reduce the false detections of theadjacent overlapping trees, we expect the predicted proposalsnot only to locate compactly to the responding objects, butalso to keep away from the other ground-truth objects, as wellas the other proposals, whose responding objects are not thesame. Thus, inspired by the work of [9] and [8], we design anovel loss function, referred to as Tree Loss (TLoss). WithTLoss, the above two targets are achieved via minimizingthe internal region distances of proposals with the same trees,and penalizing the overlap between proposals as whose targetare not the same. To the best of our knowledge, this is thefirst paper to both detect street trees and handle its occlusionproblem.

We propose the first CNN based detector for street treecounting and localization from street view images. Our de-tection method is mainly inspired by the recent work onocclusion handling in pedestrians detection [8]. However, wemake two reasonable changes against their approach to adaptit for meeting the specific characteristic in dealing with streettree images:

1) To make the network more suitable for tree detection,we divide the whole tree into three parts, instead of fourparts with the human image in [8]. For further encodingthe intrinsic structure of street trees more efficiently,we elaborately design the ratio of those parts, whichis proven to be efficient in our experiments.

2) To further handle the heavy occlusion situations, weembed the Repulsion Loss in [9] into the original lossfunction, which also assists in the improvement in de-tection accuracy under heavy occlusion.

Overall, our contributions are as follows:

1) We present a fully automatic street-view tree detectionframework, based on deep learning techniques, whichresults in favorable performance in the detection of streettrees.

2) We design an efficient part-attention module integratedinto the Faster R-CNN framework, which modifies theclassical RoI pooling unit to make the network focuson visible parts of trees and overcomes the poor perfor-mance in detecting trees under occlusion situations.

3) We propose a new loss function to enforce proposals to

locate compactly to the corresponding trees, as well asto keep away from the other ground-truth trees and theircorresponding proposals.

4) We publicly provide a street tree image dataset forautomatic tree detection research, which consists of over2900 manually labeled street-view images. Each imagecontains at least one tree, while most of the imagescontains 2-5 trees.

This paper is organized as follows. Section II reviews therelated works with respect to two aspects, tree detection anddeep learning based detection approaches. The details of theproposed algorithm are then presented in Section III. Themethod is then experimented with and validated on the pre-sented street tree dataset, including a description of the datasetas well as the training details, as described in Section IV.Section V ends this paper with conclusions and perspectiveson future promising work.

II. RELATED WORK

In this part, we divide our discussion on the related workinto two parts, focusing on tree detection and generic objectdetection.

Tree Detection. Tree detection is a long standing researchproblem in both remote sensing [10], [11], [12] and computervision [13], [14], [15]. There have been numerous attemptsto extract trees, in both the wild scene and urban scene, em-ploying the state-of-the-art detection and pattern classificationalgorithms in the remote sensing field for forest inventory.Airborne laser and lidar are usually used to capture the 3Dpoint cloud of the trees. The trees are then detected from thepoint cloud by analyzing their 3D structure information [16],[17]. Aval et al. [18] proposed to detect individual trees alongthe street from airborne hyperspectral data and digital surfacemodels. Nevertheless, 3D sensors, like laser scanner, are muchmore expensive. Using high-resolution satellite images, Liet al. [19], [20] proposed a two-stage convolutional neuralnetwork for oil palm tree detection, obtaining the F1-scoreof 94.99% in a study area of about 55km2. In contrast,normal 2D images are a cheaper choice. In the computer visionfield, a street tree is important information for the automaticdriving task. Videos captured by the vision system would bepreprocessed via analyzing the structure information before itis used to assist the decision making. Segmentation is alwaysthe first step. Most of the vision based automatic drivingtechniques would first segment the scene, including trees,captured by the cameras in the car [21]. However, this kindof segmentation is a coarse analysis of street tree information,which is unsatisfactory for tasks, like individual tree counting.There are also some attempts on applying deep learning tech-niques into the detection of trees. For instance, Shah et al. [1]developed a automatic framework based on a tree detectionalgorithm and the quadcopter device to recognize and localizetrees for preventing deforestation. Their algorithm consists ofthe state-of-the-art one-stage CNN based detector YOLO [22]to detect trees, and monocular quadcopter flies to take imagepairs and locate detected trees via assigning GPS coordinatesof the quadcopter to them. Nevertheless, they directly employ

Page 3: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Image

preprocess TLoss

TPAM

Input image ResultPreprocessed image Tree detection network

Fig. 2. Overview of the proposed framework for automatic tree detection in street-view images. Some images with low illuminance are first pre-preprocessedto present the details. Integrated with the proposed TLoss and tree part-attention module (TPAM), the tree detection network then can correct detect treesunder low illuminance and heavy occlusion conditions.

the generic object detector in [22] via finetuning in their owndataset of trees. Thus, their tree detection module has a verylimited capability of handling the occlusion problem in treedetection scenarios, due to the inherent characteristics of theYOLO architecture.

Object Detection via CNNs. General object detection isan important research area in computer vision [23], [24], [25].Recently, with the application of deep convolutional neuralnetworks [26], deep learning based methods have come todominate the field of detection in recent years.

For generic object detection, deep learning based detectorscan be grouped into two categories, two-stage and one-stage.In a two-stage approach, a sparse set of candidate proposalsare first generated using region proposal methods like SelectiveSearch [27], while the second stage classifier would filter outthe proposals belonging to the background. R-CNN [28] is thefirst successful attempt to replace the second stage classifierwith a convolutional network, yielding large gains in accuracy.Numerous improvements on this framework have since beenproposed [29], [30], [31]. Among them, Ren et al. [2] inte-grated the Region Proposal Network (RPN) with the second-stage classifier into a single convolution network, formingthe end-to-end framework Faster R-CNN. Despite achievinghigh detection accuracy, two-stage methods are unsatisfyingin terms of speed. To achieve a balance between accuracy andspeed, one-stage methods [32], [22], [33] with deep networksare proposed for real-time performance. It is noted that thestate-of-the-art one-stage detection system YOLOv3 [34] hasachieved at 28.2 mAP to process a single 320 × 320 pix-el frame in 22ms. Recently, deep learning based detectorshave also been employed in some engineering applications,such as defect detection [35], [36], [37]. However, very fewconventional vision-based algorithm and deep learning basedapproaches specialize in tree detection tasks.

III. METHOD

We present a novel framework for the detection of streettrees in low illuminance situations and crowded scenarios.Figure 2 shows the overview of our framework pipeline. Themethod is comprised of two primary stages: (1) brightnessadjustment for input images and (2) street tree detection.

The first preprocessing stage is responsible for the adaptiverestoration for images with low illuminance, which would

increase the detection accuracy, as will be shown in the ex-periment section. Once the images are brightened, the secondstage focuses on the tree detection with the proposed network.The detection network, i.e. PANTD, is based on the state-of-the-art two-stage Faster R-CNN detector, composed of theResNet backbone, region proposal network and the Fast R-CNN [38] module, as shown in Figure 3. To address theocclusion problem in street tree detection, we design a partattention module and add it into the Faster R-CNN network.We also propose a new loss function, TLoss, to furtheralleviate occlusion via punishing proposals from shifting toother ground-truth objects.

A. Image preprocessing

In this section, we present a simple yet effective approachto increase the brightness of some input images to solvethe problem of missing detection caused by low illuminancesituations, as shown in Figure 4. Under the effect of badlighting conditions, some captured images have such lowilluminance that trees are very difficult to detect by theproposed detector. Figure 4 shows an example of this situation.As can be seen, even humans can barely figure out where thetrees are in such a image. However, this low illuminance issuesometimes happens and would greatly hinder the performanceof the detection network. Thus, as a pre-processing step inthe proposed framework, an adaptive brightness increasingalgorithm is adopted to automatically judge whether the inputimage is too dark and increase brightness accordingly, beforebeing sent to the subsequent detection module.

The input image S is first converted to a gray image, thenthe average gray value g is calculated on it. If the average grayvalue is less than 60, we then apply the following equation toget the pre-processed image D. The threshold 60 is selectedby our experiments.

D = ϕS + δ (1)

where ϕ is adaptively set to be ϕ = 100g , and δ is 10 in

our experiments. The brightening operation is performed byeach channel. Note that the preprocessing operation can bereplaced by some other image preprocessing methods, such ashistogram stretching.

Our simple yet effective brighten operation in the preprocessstage will result in a positive improvement in accuracy under

Page 4: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

RPN

with TLoss

TPAMFast R-

CNN

ResN

et-101

bac

kb

one

Fig. 3. Network architecture of the proposed model for tree detection (PANTD). The pre-processed image is first sent into the ResNet101 [39] backbone forintermediate feature map extraction. The Region Proposal Network with TLoss is then used to generate high-quality proposals from these feature maps. Wethen feed the proposals into the tree part-attention module and Fast R-CNN model to finally determine the output result.

(a) Input with ground truth (c) Result+Preprocess(b) Result-Preprocess

Fig. 4. Predicted results on low illuminance cases. (a) Input images withground truths (red). (b) Detection results (green bounding boxes) of theproposed framework without the brightness adjustment operation, in whichsome ground truth trees are missed. (c) Detection results of the full frameworkwith pre-process.

such cases. As can be seen from Figure 4, the original inputimages are too dark to recognize trees. However, withoutpreprocessing, it is surprising that the proposed network canstill detect some trees correctly under such low illuminance,while still missing some. Furthermore, it is reasonable that ourframework detects all the trees with the brighten operation.

B. Part Attention Network for Tree Detection (PANTD)

Backbone Architecture. In deep learning-based objectdetection architectures, the first component is normally a pre-trained CNN with the output of an intermediate layer. Asthe feature maps extracted by the pre-trained network couldlay a good foundation for the subsequent task, i.e. regionproposal and classification, the choice of the backbone networkis crucial. This is because the types of layers and the numberof parameters would directly affect the memory, speed andperformance of the detector. As verified in several works [33],[40], ResNet [39] achieves a significant performance improve-ment in detection accuracy compared to the original VGG-16 [41] and ZF-net [42] in Faster R-CNN. With residual

connections and batch normalization operations, ResNet iseasier to train deep models. Thus, we adapt the ResNet-101pre-trained on the ImageNet dataset [43] as our backbonenetwork. ResNet-101 consists of 5 main residual blocks (i.e.conv1, conv2 x, conv3 x, conv4 x, conv5 x), each of whichcontains a set of repeated residual layers. In all, ResNet-101architecture is comprised of 101 layers, which are repeatedconvolution and pooling layers along with fully connectedlayers. In our experiments, we choose the last layer of conv4 xlayers in ResNet-101 to use for predicting region proposals,as in [39].

Region Proposal Network for Tree Detection. To gen-erate good-quality tree proposals, we adapt a tailored regionproposal network over the convolutional feature map output,by the last convolutional layer of the backbone model (ResNet-101). Compared with previous region proposal methods, likeSelective Search [27], the RPN architecture adopts automaticfeature learning, instead of the traditional hand-crafted featureextraction to significantly improve accuracy and robustness.A region proposal network takes feature maps as input andoutputs a collection of bounding boxes (300 proposals), eachwith a relatively higher confidence score. Similar to [44],we accordingly modify RPN for the tree detection task.Specifically, the aspect ratio of anchors is fixed to be 0.4(width to height), as observed in this paper. Unlike genericobject detection, this modification can avoid accuracy decreasecaused by anchors of inappropriate aspect ratios. However,for each pixel in the intermediate feature map, the number ofanchors is still nine, which is the combination of 9 differentscales, varying from 40 pixels height with a scaling factorof 1.2. This is to handle the multi-scale cases. Moreover, thearchitecture of RPN is implemented with a 3×3 convolutionallayer followed by two sibling 1×1 convolutional layers-a box-regression layer and a box-classification layer.

TLoss. Occlusion remains a significant challenge in streettree detection, which usually leads to false detections of theadjacent overlapping trees. These errors are usually caused bypredicted boxes shifting to neighboring ground truth trees, orbounding boxes belonging to the union regions of overlappingground truth trees. Therefore, we expect that proposals shouldlocate compactly to the associated ground truth trees and stay

Page 5: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

repulsion

(a) (b) (c)

aggregation

Fig. 5. Illustration of the proposed Tree Loss. (a) An image with ground truthlabel (red) and several proposals (green). (b) Repulsion loss: the proposalshould repel to the ground truth object, which is not its correspondence. (c)Aggregation loss: the proposals corresponding to the same ground truth shouldstay as closely to each other as possible. From these two constraints, falsedetection via intra-class occlusion should be strongly relieved.

away from their neighboring ground truth trees, which are nottheir targets, as shown is Figure 5. To this end, we introducethe TLoss LTree in the RPN module to realize the two aboveconstraints, which is the sum of the following two terms witha balance weight:

LTree = Lcls + α · LRep + β · LAgg (2)

TLoss is made up of three components, Lcls, LRep andLAgg , where coefficients α and β act as the hyper-parameterto balance the two losses. Both of the two coefficients are setto be 1. We calculate the classification loss Lcls by using logloss over two classes, i.e. foreground and background,

Lcls =1

Ncls

∑i

− (p∗i log pi + (1− p∗i ) log (1− pi)) (3)

where p∗i and pi are the associated ground truth class label andprediction of the i-th anchor, respectively. And Ncls representsthe number of anchors.

Symbolically, we let b = (xb, yb, wb, hb) ∈ B be theproposal bounding box, and g = (xg, yg, wg, hg) ∈ G be theground truth object. (x, y) is the center point of the box inimage pixel coordinate system, w and h represent the widthand height of the box. B and G are respectively the proposaland ground truth object sets in one mini-batch.

Repulsion Loss is designed to repel a proposal from itsneighboring ground truth objects, which are not its correspon-dence. Given a proposal bounding box b, the correspondingground truth object is defined as gb = arg maxg∈G IoU(g, b).Its neighboring ground truth object gbn can then be defined as:

gbn = arg maxg∈G\gb

IoU(g, b) (4)

The repulsion loss can then be formulated as penalizing theoverlap between a proposal bounding box b and its neighboringground truth object gbn. We evaluate the overlap by Intersectionover Ground truth (IoG):

IoG(b, g) =R(b ∩ g)

R(g)(5)

We then calculate repulsion loss as:

LRep =

∑b∈B smoothln(IoG(b, gbn))

|P |(6)

where P is the set of all positive proposals as in [9].Aggregation Loss enforces those predicted bounding boxes,

which are with the same target ground truth object, to becompact with each other, reducing false detections of theadjacent overlapping trees. Specifically, let K be the numberof ground truth objects associated with more than one anchorin one mini-batch, and g′

1, · · · , g′

K be the collection of theseground truth bounding boxes. I1, · · · , IK stands for theindex sets of the corresponding anchors of the ground truthboxes. That is, ground truth g

j are associated with anchorsindexed by Ij . The aggregation loss is then defined as:

LAgg =

∑Ki=1 smoothL1(g

i − 1|Ii|∑

j∈Ii bi)

K(7)

The smooth L1 loss is formulated as

smoothL1(x) =

0.5x2 if |x| < 1|x| − 0.5 otherwise

(8)

Tree Part-attention Module (TPAM). On the curbside,street trees are often occluded by each other, as well as otherthings such as cars parked along the road, which significantlyimpedes the high accuracy performance of tree detectors. Asobserved, bad detection performance is usually caused by theinvisibility of occluded parts, like the tree trunk or the crown,in the whole tree. Thus, we propose to embed the part-baseddetection model, which has been indicated to be effectivein handling occluded scenarios [45], [46], into the classicalFaster R-CNN detection framework. Specifically, we presenta tree part-attention module to integrate the prior structureinformation of a tree with visibility prediction into the detector.As shown in Figure 6, the integration procedure is achieved bya micro neural network module, which estimates the visibilityof each part. After the tree proposals are generated by theRegion Proposal Network (RPN), we divide the whole treeregion into three parts according to the inherent structure of thetree, as illustrated in Figure 6(a). Instead of directly feedingthe whole region of the feature map into the ROI poolinglayer, we first separately pool the feature map of each dividedpart into a feature map with a fixed spatial size of H ×W(e.g., 7× 7). These feature maps of parts are then assigned todifferent attention weights, according to their visibility. Thosefeature maps of occluded parts would be assigned to smallerweights during combination. In this way, those visual parts ofthe tree can compulsorily contribute more to subsequent taskslike classification, which certainly leads to a higher accuracyperformance. The associated weights are decided by occlusionscores, indicating the degree of occlusion. According to itsinherent structure, the division ratio of each part is given inFigure 6(a). The whole tree area is divided into three parts,two parts for the crown and one part for the tree trunk. Allthree parts have the same width W

2 , while the height values are2H3 for the above two crown parts and H

3 for the tree trunkpart. W and H are, respectively, the width and height of aproposal bounding box.

In all, we define a part-attention module which performs aweighting function over the visibility of the proposal to adap-tively weight the outputs from the feature map of each part.Leveraging the recently popular attention mechanisms [47],

Page 6: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

ROI Pooling

Eltw sum

ROI PoolingAttention

process unit

ROI PoolingAttention

process unit

ROI PoolingAttention

process unit

Co

nv

3x3

-s1,6

4C

on

v3

x3

-s1,3

2C

on

v3

x3

-s1,2

softm

ax

Scale

2/3H

1/3H

1/2W 1/2W

1/2W

(a) Ratio of three parts (b) Part-attention module (c) Attention process unit

Fig. 6. Architecture of the tree part-attention module (TPAM). (a) Division ratio of three parts in a tree proposal. (b) Part-attention module. Each part pi,instead of the whole proposal region, is assigned to an attention score ai, then combined together. (c) The attention unit takes the part feature map as inputand outputs the weighted feature map.

the TPAM makes the Fast R-CNN module pay more attentionto the visible parts in the feature map. More specifically, letci,j denote the jth part in the ith proposal, and the attentionscore ai,j of ci,j is estimated by the attention weight unit. Asshown in Figure 6, the attention weight unit is assembled bythree convolutional layers, followed by a softmax layer withthe log loss. Thus, the unit can be fed into feature maps ofeach part and output corresponding predicted attention scores.During the training procedure, the ground truth attention scorea∗i,j is computed in advance. That is

a∗i,j =

1 Θ > 0.50 otherwise

(9)

Attention score is formulated as the intersection between tworegions. Accordingly, Θ is defined as

Θ =Ω(Rc

i,j ∩ V ∗i )

Ω(Ri,j)(10)

Ω(·) is the area computing function. Rci,j is the region of jth

part in ith proposal, and V ∗i is the visible region of the groundtruth object gi. From the definition, attention weight would be1, if the intersection area between ci,j and the visible regionof corresponding ground truth object gi exceeds 0.5. Givena image as input, the attention weights are computed by theAttention process unit in Figure 6(c) . The attention processunit would give a weight score to the given part, according tothe pixel content of the given part. Intuitively, the part withoutocclusion would output a high attention weight, while theoccluded part would give a small weight value. The weightsin the Attention process unit are learned under the supervisionof occlusion score.

IV. RESULTS AND DISCUSSIONS

This section presents experiments carried out to evaluatethe performance of the proposed detection network. Variousvisualized results are first shown on street tree detection. Wethen illustrate the role of each proposed module in detection

accuracy through ablation study. In addition, we comparethe proposed method with previous state-of-the-art detectionnetworks. We also describe the dataset used in our paper andthe training setting in detail.

TABLE IDETAIL INFORMATION ABOUT THE ORIGINAL STREET VIEW TREE IMAGES

DATASET.

Tree species No. of imagesPlatanus 487Zelkova 452

Koelreuteria 408Privet 348

Metasequoia 200Lamucuo 455Ginkgo 569Total 2,919

A. Dataset

We apply the proposed tree detection network into the streettree detection task. For training and evaluating the proposednetwork, we present a new image dataset of street tree. Sincethere is no existing street tree image dataset, we capture imagesfrom 22 roads in the City of Nanjing in China by using streetview collection vehicle, equipped with a camera with fivefisheye lens. All five lens are set to take photos synchronouslyin a fixed time interval, while the vehicle is moving uniformlyon the road. The five lens are calibrated in advance, thus, foreach shot, five pictures separately captured by each len can beregistered together, forming one unified panorama, as shownin Figure 7. To discard regions belonging to the sky and theroad, this panorama is then projected into a hexahedral box.Five images are generated accordingly, and only two images onthe left and right side are taken into our dataset. These imagesare then annotated by a professional municipal managementcompany, which is cooperating with us on this project. TheLabelImg [48] tool is used to manually label the data. Ourtree dataset contains a total of 2919 labeled images, with 8297

Page 7: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

(b)(a)

Fig. 7. (a) Street-view vehicle equipped with fish lens. (b) Example of original panoramic image captured by the street-view vehicle. The original imagescan not be directly used as shown.

ground truth bounding boxes. The detail information of theproposed dataset is described in Table I.

To train our tree detection model, we divide the datasetinto three groups, as shown in Table II. All the results inthis section are computed from the Test set. Note that wealso assign the ground truth bounding box with different labelaccording to their species in the dataset. However, we justtreat all different categories of tree as one class in this paper,since we mainly focus on the tree detection task here. In thesubsequent work, we would further to classify these detectedtrees, as mentioned in the future work part. Moreover, wewould also release the dataset on our website to accelerate theresearch on automatic detection of trees.

TABLE IIDETAIL INFORMATION ABOUT DATASET DIVISION FOR TRAINING AND

TESTING.

Group Training Validation TestNumber of images 2,200 200 500Number of trees 6,251 535 1,464

B. Training details

Given a street-view image with ground truth annotations,we define the anchors are positive, when it has a Intersection-over-Union (IoU) score greater 0.5 with a single ground truth,while anchors with less than a 0.3 IoU score are defined asnegative. When training the RPN, we randomly choose 256anchors (128 positive and 128 negative anchors) to form amini batch. Then for each iteration, all the anchors in the minibatch are used to calculate the classification loss using binarycross entropy, with foreground anchors for the regression loss.The stochastic gradient method is utilized to conduct forwardpropagation, and compare the predicted results with the groundtruth and compute the loss, which is then backpropagated toupdate the model parameters. The approximate joint trainingstrategy is adopted to save training time, while maintain theaccuracy. We train 60,000 iterations with an initial learningrate of 0.001, momentum of 0.9 and weight decay of 0.0001

respectively. After 40,000 iterations, we change the learningrate to 0.0001. The training and experiments are based on acomputer with a Core i7-8700K CPU, 16 GB DDR4 memory,with an 8 GB NVIDIA GeForce GTX 1080 GPU.

C. Evaluation Metrics

To quantitatively evaluate the performance of tree detectors,we plot miss rate (MR) against false positives per image (FPPI)on a log scale, by varying the threshold on detection confi-dence, following [49]. For certain tasks with only one class oftarget, e.g. pedestrian detection and defect region detection,miss rate vs false positives per image curve is preferred toprecision recall curves in generic object detection. We then usethe log-average miss rate MR−2 as a single reference valueto generally summarize detector performance, which is similarto the average precision. This value is computed in MATLAB,by averaging miss rate at nine FPPI rates uniformly spaced inlog-space in the range 10−2 to 100:

MR−2 = exp

[1

n

n∑i=1

ln ai

](11)

where n is set to be 9, and ai is the positive value correspond-ing to the miss rate at 9 evenly spaced FPPI point.

D. Qualitative Results

Inter-class occlusion cases. To verify the effectiveness ofPANTD under an occlusion situation, we test the proposedmethod in various inter-class occlusion cases. As shownin Figure 8, the captured images contain heavy occlusion,where the crowns connect with each other into a whole.It is almost impossible for a human being to distinguishthese trees from each other. As can be seen, the Faster R-CNN misses some trees, since some boxes are shifting to theintermediate positions. In contrast, the TLoss in our networkrestrains the predicted bounding boxes from being too closewith other bounding boxes. Moreover, the predicted boundingboxes are also forced to locate compactly with each other, iftheir corresponding ground truth is the same. The proposed

Page 8: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE IIIEFFECTS OF THE PROPOSED METHODS (I.E. PREPROCESS, TLOSS AND PART-ATTENTION MODULE) ON THE TEST DATASET. PERFORMANCE IS

COMPUTED IN TERMS OF LOG AVERAGE MISS RATE. THE LOWER THE BETTER.

Method +Preprocess +TLOSS +Part-attention module MR−2(%)

Base-line(Faster R-CNN) 36.23PANTD-A

√35.52

PANTD-B√

33.64PANTD-C

√30.23

PANTD-D√ √

26.89PANTD-E

√ √24.30

PANTD-F√ √ √

20.62

OursFaster R-CNN(a) Input with ground truth

Fig. 8. Inter-class occlusion case detection results. Some occluded trees aremissed by Faster R-CNN, while our model can correctly detect them.

OursFaster R-CNN(a) Input with ground truth

Fig. 9. Intra-class occlusion case detection results. As shown, even merelypart of trees can be seen, our method can still detect correctly.

TLoss avoids the mutual interference between the trees incrowded scenes. Thus, the proposed approach displays muchbetter results in localizing the occluded trees, as shown in thevisualized detection examples.

Intra-class occlusion cases. We also compare our methodwith Faster R-CNN on intra-class occlusion cases to demon-strate the robustness of the proposed part-attention moduleunder occlusion. Figure 9 verifies that the proposed networksucceeds in detecting the trees with just a part visible, while

MR-FPPI

mis

s ra

te

false positive per image

PANTD-APANTD-BPANTD-CPANTD-DPANTD-EPANTD-F

Fig. 10. Quantitative evaluation of the proposed method with variousconfigure settings. PANTD-F is the algorithm with full configuration. Thelower the better.

Faster R-CNN detector misses them. The reason behind this isthat the RoI pooling layer in Faster R-CNN prefers to predictthose boxes with intact feature map. However, part occlusionusually leads to an incomplete features, which can still becaptured by our part-attention module. When some parts ofthe tree are missing which is caused by occlusion, the part-attention mechanism will accordingly assign higher weight tothe visible part during the classification and regression stageof candidate bounding boxes. Thus, our tree part-attentionnetwork can still tell us if the candidate object is a tree ornot merely through visible parts.

E. Ablation Study

To demonstrate the effectiveness of the pre-process stage,TLoss and our part-attention module, we analyze the results ofour network trained under different settings of these modules.We use Faster R-CNN as the base-line method. We set up 6different network architecture, represented by PANTD-A(-F).A-E stand for networks with one or two module embedded in,as shown in Table III. The proposed method, which combinesthe all three modules, is represented in PANTD-F. Throughthis configuration, each proposed component could clearlybe evaluated. As can be seen from Figure 10 and Table III,merely with the brightness adjustment process (PANTD-A),the miss rate is reduced by 0.71 (from 36.23 to 35.52MR−2).

Page 9: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

0.982 0.997

0.965

0.992 0.971

0.995

0.926 0.897

0.372

0.629 0.919

0.904

(a) (b)

(c) (d)

Fig. 11. Some examples of the predicted attention score of the tree partsusing the proposed tree part-attention module. As shown, our method cancorrectly predict how much attention should be paid to the parts, accordingto their visibility.

Compared with the detector only with Pre-process, Pre-process+TLoss (PANTD-D) further reduces the miss rate andachieves 26.89MR−2, which confirms the effectiveness ofthe TLoss. When we replace the RoI pooling layer with thetree part-attention module (PANTD-C), it is observed thatthe tree part-attention module has significantly improved theperformance with an impressive 6.00 promotion. The highestoverall accuracy is 20.62MR−2 by the proposed method withcomplete configuration, demonstrating the effectiveness of ourmethod under bad lighting and heavy occlusion conditions.

Meanwhile, to further illustrate the mechanism of ourproposed tree part-attention module, some visualized resultsof the predictions with attention scores of the correspondingparts are given in Figure 11. Not surprisingly, we observe thatthe proposed tree part-attention module can actually capturethe occlusion phenomenon in a real scene, in accordancewith the human visual system. As shown in Figure 11(a)and (b), the attention scores are about 1, if the correspondingparts are clearly visible. Nevertheless, when some parts areoccluded via car, bus or other trees, their attention scoresdecrease accordingly, such as the occluded tree trunk andcrown shown in Figure 11(c)-(d). Consequently, instead ofpassing the whole feature map with occluded parts usingconventional RoI pooling layer, the combination of weightedparts, in this paper, is fed to subsequent layers. This cannaturally help the network focus on useful features, whilesuppressing distractions from other objects in the occludedregions, maintaining high detection performance under heavyocclusion conditions.

F. Comparisons with Previous Methods

We conduct comparative experiments to evaluate our net-work on the proposed dataset and compare against bothstate-of-the-art two-stage generic object detectors (Faster R-CNN [2], Fast R-CNN [38]) and one-stage detectors (SS-

TABLE IVCOMPARISON RESULTS ON THE TEST DATASET. PERFORMANCE IS

COMPUTED IN TERMS OF LOG AVERAGE MISS RATE. THE LOWER THEBETTER.

Method MR−2(%)

Fast R-CNN.[38] 44.63Faster R-CNN[2] 36.23

SSD[32] 40.99YOLOv3[34] 39.53

Ours 20.62

D [32], YOLOv3 [34]). Faster R-CNN is the most representa-tive two-stage object detector using deep learning techniques,composed by three main parts, i.e. a backbone feature ex-traction network, a Region Proposal Network (RPN) and aclassification and regression network. It ,for the first time,introduces a high efficient region proposal algorithm calledthe region proposal network, which uses the feature map fromintermediate layers of a standard classification network toprovide bounding box candidates covering potential objects.YOLOv3 is the latest improved version of the state-of-the-art,real-time object detection system You only look once (YOLO).YOLO uses a single deep convolutional neural network to di-vide the input image into a grid of cells, which directly predictsa bounding box and its object classification. YOLOv3 uses afew tricks to improve training and increase performance on thebasis of its former version, including multi-scale predictions,a better backbone classifier and so on. It achieves a extremetrade off between efficiency and accuracy. Note that for afair comparison, the compared methods are also performedin the preprocessed dataset. We retrain these networks usingtrain data on the proposed tree dataset. We fine-tune thesenetworks on the training set with their original weights asinitialization to reach convergence. The obtained results aredepicted in Figure 12 and Table IV. The proposed algorithmachieves the lowest log average miss rate of 20.62. Comparedwith the results of Faster R-CNN, Fast R-CNN, SSD andYOLOv3, it can be observed that the overall performanceof the proposed network is improved and the robustness tobad lighting conditions is significantly reinforced, verifyingthe effectiveness of our proposed components, especially forbad lighting cases.

In addiction to the quantitative results, we also adopt aset of visual measurements to demonstrate the superiority ofthe proposed method. Figure 13 shows the visual results byFaster R-CNN, YOLOv3 and ours. From the first row, we cansee that all the methods succeed in detecting trees. However,the predicted bounding boxes of our method fit best withthe ground truth bounding boxes, compared to the other twomethods. That is because we integrate the prior knowledgeof tree structure into the anchor box of the region proposalnetwork. In the last two rows, it can also be observed that ourmethod can more efficiently detect the occluded trees than theother two methods.

As for the processing time, YOLOv3 is the fastest detectorundoubtedly, since it is a one-stage detector. However, theprocessing time is not the focus in this work. The starting

Page 10: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

MR-FPPIm

iss

rate

false positive per image

Fast R-CNNSSDYOLOv3Faster R-CNNOurs

Fig. 12. Quantitative evaluation of the proposed method with various othermethods. The lower the better.

point of this paper is to count the number of street trees. Thus,the most important task in this paper is to detect correctly.After the street images are captured, they will be stored andprocessed off-line. The municipal administration cares moreabout the detection accuracy, rather than the processing time.Thus, we mainly focus on the improvement on the detectionaccuracy in this paper. In all, the proposed method can achievethe best performance in terms of accuracy. However, it willtake a little more processing time when compared to YOLOv3and Faster RCNN. But it is still acceptable in practicalapplication.

V. CONCLUSIONS AND FUTURE WORK

This paper present a novel framework for tree detection,leveraging state-of-the-art deep learning based detection meth-ods, along with two innovations in training loss definition andnetwork module designation to handle the occlusion problem.In particular, we introduce a new training loss to suppress falsedetections caused by crowds, via simultaneously enforcingthe proposals, associated with the same ground truth trees, tolocate compactly with each other and repel those ground truthtrees which are not their targets. Additionally, to effectivelycope with a tree part missing, we apply the part-attentionmodule to integrate the prior structure information of treeswith occlusion prediction into the Faster R-CNN detector. Theoccluded parts predicted by the proposed unit would be ac-cordingly assigned to smaller weights to release their negativeinfluence in the subsequent classification task. We apply ourtree part-attention network into the street tree detection task.The proposed method reveals high fidelity detection over thepresented street tree dataset captured by street view collectionvehicles.

Detailed experimental comparisons also demonstrate thatour proposed framework can improve detection accuracy by alarger margin than baseline, particularly in crowded scenarios,where occlusion problems occur. However, owning to thedifference between the same tree species and the similaritywithin the diverse tree species, street tree classification is still

a challenging problem. A possible future direction would bedesigning efficient classification network to tell these treesapart according to their species after they are detected. Further-more, the proposed preprocess method is highly related withthe set of images used in this paper. It may lack the abilityof generalization to other image dataset. Thus, a more generalmethod for the low-illumination problem would be anotherresearch direction of future work.

REFERENCES

[1] U. Shah, R. Khawad, and K. M. Krishna, “Detecting, localizing, andrecognizing trees with a monocular mav: Towards preventing defor-estation,” in Robotics and Automation (ICRA), 2017 IEEE InternationalConference on. IEEE, 2017, pp. 1982–1987.

[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in Advances in neuralinformation processing systems, 2015, pp. 91–99.

[3] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,”in Proceedings of the IEEE International Conference on ComputerVision, 2013, pp. 2056–2063.

[4] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sanchez,“A survey on deep learning in medical image analysis,” Medical imageanalysis, vol. 42, pp. 60–88, 2017.

[5] X. Gibert, V. M. Patel, and R. Chellappa, “Deep multitask learning forrailway track inspection,” IEEE Transactions on Intelligent Transporta-tion Systems, vol. 18, no. 1, pp. 153–164, 2017.

[6] H. Zhang, X. Jin, Q. J. Wu, Y. Wang, Z. He, and Y. Yang, “Automaticvisual detection system of railway surface defects with curvature filterand improved gaussian mixture model,” IEEE Transactions on Instru-mentation and Measurement, vol. 67, no. 7, pp. 1593–1608, 2018.

[7] H. Yu, Q. Li, Y. Tan, J. Gan, J. Wang, Y.-a. Geng, and L. Jia, “A coarse-to-fine model for rail surface defect detection,” IEEE Transactions onInstrumentation and Measurement, vol. 68, no. 3, pp. 656–666, 2018.

[8] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Occlusion-aware r-cnn:Detecting pedestrians in a crowd,” arXiv preprint arXiv:1807.08407,2018.

[9] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen, “Repulsionloss: Detecting pedestrians in a crowd,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2018, pp.7774–7783.

[10] H. Kaartinen, J. Hyyppa, X. Yu, M. Vastaranta, H. Hyyppa, A. Kukko,M. Holopainen, C. Heipke, M. Hirschmugl, F. Morsdorf et al., “Aninternational comparison of individual tree detection and extraction usingairborne laser scanning,” Remote Sensing, vol. 4, no. 4, pp. 950–974,2012.

[11] B. Guo, X. Huang, F. Zhang, and G. Sohn, “Classification of airbornelaser scanning data using jointboost,” ISPRS Journal of Photogrammetryand Remote Sensing, vol. 100, pp. 71–83, 2015.

[12] D. Tianyang, Z. Jian, G. Sibin, S. Ying, and F. Jing, “Single-tree detec-tion in high-resolution remote-sensing images based on a cascade neuralnetwork,” ISPRS International Journal of Geo-Information, vol. 7, no. 9,p. 367, 2018.

[13] J. Secord and A. Zakhor, “Tree detection in urban regions using aeriallidar and image data,” IEEE Geoscience and Remote Sensing Letters,vol. 4, no. 2, pp. 196–200, 2007.

[14] S. Malek, Y. Bazi, N. Alajlan, H. AlHichri, and F. Melgani, “Efficientframework for palm tree detection in uav images,” IEEE Journal ofSelected Topics in Applied Earth Observations and Remote Sensing,vol. 7, no. 12, pp. 4692–4703, 2014.

[15] A. H. Ozcan, Y. Sayar, D. Hisar, and C. Unsalan, “Multiscale tree anal-ysis from satellite images,” in Recent Advances in Space Technologies(RAST), 2015 7th International Conference on. IEEE, 2015, pp. 265–269.

[16] O. Nevalainen, E. Honkavaara, S. Tuominen, N. Viljanen, T. Hakala,X. Yu, J. Hyyppa, H. Saari, I. Polonen, N. N. Imai et al., “Individualtree detection and classification with uav-based photogrammetric pointclouds and hyperspectral imaging,” Remote Sensing, vol. 9, no. 3, p.185, 2017.

[17] V. F. Strımbu and B. M. Strımbu, “A graph-based segmentation algorithmfor tree crown extraction using airborne lidar data,” ISPRS Journal ofPhotogrammetry and Remote Sensing, vol. 104, pp. 30–43, 2015.

Page 11: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Input with

ground truth (b) Faster R-CNN (c) YOLOv3 (d) Ours(a)

Fig. 13. Some examples of comparison results between Faster R-CNN, YOLOv3 and our method. As can be seen, directly applying Faster R-CNN orYOLOv3 into the task of street tree detection is unsatisfied due to the bad lighting condition and heavy occlusions.

[18] J. Aval, J. Demuynck, E. Zenou, S. Fabre, D. Sheeren, M. Fauvel, andX. Briottet, “Individual street tree detection from airborne data andcontextual information,” in GEOBIA 2018-From pixels to ecosystemsand global sustainability?, 2018.

[19] W. Li, H. Fu, L. Yu, and A. Cracknell, “Deep learning based oil palmtree detection and counting for high-resolution remote sensing images,”Remote Sensing, vol. 9, no. 1, p. 22, 2016.

[20] W. Li, R. Dong, H. Fu et al., “Large-scale oil palm tree detection fromhigh-resolution satellite images using two-stage convolutional neuralnetworks,” Remote Sensing, vol. 11, no. 1, p. 11, 2019.

[21] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation forautonomous driving with deep densely connected mrfs,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 669–677.

[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEE

conference on computer vision and pattern recognition, 2016, pp. 779–788.

[23] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.886–893.

[24] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,”2009.

[25] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade objectdetection with deformable part models,” in Computer vision and patternrecognition (CVPR), 2010 IEEE conference on. IEEE, 2010, pp. 2241–2248.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[27] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,

Page 12: JOURNAL OF LA Detecting Trees in Street Images via Deep ...3dgp.net/paper/2019/Detecting Trees in Street Images via Deep Learning with Attention...more expensive. Using high-resolution

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

“Selective search for object recognition,” International journal of com-puter vision, vol. 104, no. 2, pp. 154–171, 2013.

[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2014, pp. 580–587.

[29] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deepconvolutional networks for visual recognition,” in European conferenceon computer vision. Springer, 2014, pp. 346–361.

[30] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable objectdetection using deep neural networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2014, pp.2147–2154.

[31] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment objectcandidates,” in Advances in Neural Information Processing Systems,2015, pp. 1990–1998.

[32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in European conference oncomputer vision. Springer, 2016, pp. 21–37.

[33] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXivpreprint, 2017.

[34] ——, “Yolov3: An incremental improvement,” arXiv preprint arX-iv:1804.02767, 2018.

[35] J. Chen, Z. Liu, H. Wang, A. Nunez, and Z. Han, “Automatic defectdetection of fasteners on the catenary support device using deep con-volutional neural network,” IEEE Transactions on Instrumentation andMeasurement, vol. 67, no. 2, pp. 257–269, 2017.

[36] G. Kang, S. Gao, L. Yu, and D. Zhang, “Deep architecture for high-speed railway insulator surface defect detection: Denoising autoencoderwith multitask learning,” IEEE Transactions on Instrumentation andMeasurement, 2018.

[37] J. Zhong, Z. Liu, Z. Han, Y. Han, and W. Zhang, “A cnn-based defectinspection method for catenary split pins in high-speed railway,” IEEETransactions on Instrumentation and Measurement, 2018.

[38] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE internationalconference on computer vision, 2015, pp. 1440–1448.

[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[40] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs formodern convolutional object detectors,” in IEEE CVPR, vol. 4, 2017.

[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[42] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in European conference on computer vision. Springer,2014, pp. 818–833.

[43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet largescale visual recognition challenge,” International Journal of ComputerVision, vol. 115, no. 3, pp. 211–252, 2015.

[44] L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-cnn doing wellfor pedestrian detection?” in European Conference on Computer Vision.Springer, 2016, pp. 443–457.

[45] C. Zhou and J. Yuan, “Multi-label learning of part detectors for heavilyoccluded pedestrian detection,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2017, pp. 3486–3495.

[46] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang, “Jointlylearning deep features, deformable parts, occlusion and classificationfor pedestrian detection,” IEEE transactions on pattern analysis andmachine intelligence, vol. 40, no. 8, pp. 1874–1887, 2018.

[47] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in Advances in neuralinformation processing systems, 2015, pp. 577–585.

[48] Tzutalin, “Labelimg,” https://github.com/tzutalin/labelImg, 2015.[49] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An

evaluation of the state of the art,” IEEE transactions on pattern analysisand machine intelligence, vol. 34, no. 4, pp. 743–761, 2012.

Qian Xie is currently working towards the Ph.D.degree at Nanjing University of Aeronautics and As-tronautics (NUAA), China. He received his Bachelordegree in Computer-Aided Design from NUAA in2015. His research interests include computer vision,robotics and machine learning.

Dawei Li is currently working towards the Ph.D.degree at Nanjing University of Aeronautics and As-tronautics (NUAA), China. He received his Bachelordegree in Mechanical Design and Manufacturingand Automation from Anhui University of Technol-ogy(AHUT) in 2015. His research interests includeimage processing and machine learning.

Zhenghao Yu is a graduate student at NanjingUniversity of Aeronautics and Astronautics(NUAA),China, from 2017. He received his bachelor de-gree in Electrical Engineering and Automation fromHenan Polytechnic University in 2016. His researchis mainly about Computer Vision and MachineLearning.

Jun Zhou is currently working towards the Mas-ter. degree at Nanjing University of Aeronauticsand Astronautics (NUAA), China. He received hisBachelor degree in Mechanical design, manufactureand automation from Nanjing Agricultural Univer-sity (NJAU) in 2018. His research interests includecomputer vision and machine learning.

Jun Wang is currently a professor at Nanjing U-niversity of Aeronautics and Astronautics (NUAA),China. He received his Bachelor and PhD degreesin Computer-Aided Design from NUAA in 2002 and2007 respectively. From 2008 to 2010, he conductedresearch as a postdoctoral scholar at the Universityof California and the University of Wisconsin. From2010 to 2013, he worked as a senior research engi-neer at Leica Geosystems, USA. In 2013, he paidan academic visit to the Department of Mathematicsat Harvard University. His research interests include

geometry processing and geometric modeling.