traffic sign detection and recognition using fully

9
Trafc sign detection and recognition using fully convolutional network guided proposals Yingying Zhu, Chengquan Zhang, Duoyou Zhou, Xinggang Wang, Xiang Bai, Wenyu Liu n School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China article info Article history: Received 2 February 2016 Received in revised form 11 June 2016 Accepted 3 July 2016 Communicated by Wang Qi Available online 20 July 2016 Keywords: Trafc sign recognition Fully convolutional network Object proposal Object detection abstract Detecting and recognizing trafc signs is a hot topic in the eld of computer vision with lots of appli- cations, e.g., safe driving, path planning, robot navigation etc. We propose a novel framework with two deep learning components including fully convolutional network (FCN) guided trafc sign proposals and deep convolutional neural network (CNN) for object classication. Our core idea is to use CNN to classify trafc sign proposals to perform fast and accurate trafc sign detection and recognition. Due to the complexity of the trafc scene, we improve the state-of-the-art object proposal method, EdgeBox, by incorporating with a trained FCN. The FCN guided object proposals can produce more discriminative candidates, which help to make the whole detection system fast and accurate. In the experiments, we have evaluated the proposed method on publicly available trafc sign benchmark, Swedish Trafc Signs Dataset (STSD), and achieved the state-of-the-art results. & 2016 Elsevier B.V. All rights reserved. 1. Introduction Trafc scene analysis is a very important topic in computer vision and intelligent systems [16]. Trafc signs are designed to inform drivers of the current condition and other important in- formation of the road. They are rigid and simple shapes with eye- catching colors and the information they carry is easy to under- stand. However, accidents may still occur when drivers do not pay attention to a trafc sign in time. Hence, it is very important to design an automatic real-time driver assistance system to detect and recognize trafc signs. Although the research community and industry in the eld of computer vision have achieved signicant progresses in trafc sign detection and recognition in the past decades, there are two main inevitable difculties. One is poor image quality due to low re- solution, motion blur and noise. The other is uncontrolled en- vironmental factors including weather conditions, complex back- ground, variable illumination, sign color fading, etc. Fig. 1 shows some difcult examples. These make the trafc sign detection and recognition task still an open problem. In order to tackle these problems, we propose a novel and ef- cient method to detect and recognize trafc signs. Generally, trafc signs have a distinctive property that they always appear on the two sides of the road to be traveled. Taking advantage of this property, the proposed method extracts trafc sign proposals by the guidance of fully convolutional network, unlike traditional trafc sign detection and recognition methods, which usually start from color segmentation [711], shape detection [1215] or slid- ing-window scanning [16,17] to nd trafc sign regions. This can effectively reduce the search range of trafc signs, so as to reduce the number of proposals and improve the efciency. The pipeline of the proposed method for trafc sign detection is illustrated in Fig. 2, which is mainly composed of two parts. One is the proposal extraction stage guided by fully convolutional network [18] and EdgeBox [19]. The other is the trafc sign clas- sication stage using convolutional neural network [20]. Given a scene image, coarse regions of trafc signs are generated by fully convolutional network rstly. Then, proposals of trafc signs are extracted by EdgeBox from the coarse sign regions. Finally, trafc signs are identied and false positives are eliminated by a trained convolutional neural network classier and the optimal bounding boxes are retained by non-maximal suppression. Different from the previous methods on both trafc sign de- tection and general object detection, a novel FCN guided object proposal method is proposed. In the case of trafc sign detection where pixel-level annotation is unavailable, we propose to train FCN using boundingbox-level annotation which is a typical weak supervision for the semantic segmentation method like FCN. The proposed method is validated on a publicly available da- tabase: Swedish Trafc Signs Dataset (STSD) [21]. The experi- mental results demonstrate that the proposed method has a much higher detection rate than other methods while producing less false positives. In addition, the proposed method is not time consuming, which is potentially to be applied in real time Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2016.07.009 0925-2312/& 2016 Elsevier B.V. All rights reserved. n Corresponding author. E-mail address: [email protected] (W. Liu). Neurocomputing 214 (2016) 758766

Upload: others

Post on 20-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Traffic sign detection and recognition using fully

Neurocomputing 214 (2016) 758–766

Contents lists available at ScienceDirect

Neurocomputing

http://d0925-23

n CorrE-m

journal homepage: www.elsevier.com/locate/neucom

Traffic sign detection and recognition using fully convolutionalnetwork guided proposals

Yingying Zhu, Chengquan Zhang, Duoyou Zhou, Xinggang Wang, Xiang Bai, Wenyu Liu n

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

a r t i c l e i n f o

Article history:Received 2 February 2016Received in revised form11 June 2016Accepted 3 July 2016Communicated by Wang QiAvailable online 20 July 2016

Keywords:Traffic sign recognitionFully convolutional networkObject proposalObject detection

x.doi.org/10.1016/j.neucom.2016.07.00912/& 2016 Elsevier B.V. All rights reserved.

esponding author.ail address: [email protected] (W. Liu).

a b s t r a c t

Detecting and recognizing traffic signs is a hot topic in the field of computer vision with lots of appli-cations, e.g., safe driving, path planning, robot navigation etc. We propose a novel framework with twodeep learning components including fully convolutional network (FCN) guided traffic sign proposals anddeep convolutional neural network (CNN) for object classification. Our core idea is to use CNN to classifytraffic sign proposals to perform fast and accurate traffic sign detection and recognition. Due to thecomplexity of the traffic scene, we improve the state-of-the-art object proposal method, EdgeBox, byincorporating with a trained FCN. The FCN guided object proposals can produce more discriminativecandidates, which help to make the whole detection system fast and accurate. In the experiments, wehave evaluated the proposed method on publicly available traffic sign benchmark, Swedish Traffic SignsDataset (STSD), and achieved the state-of-the-art results.

& 2016 Elsevier B.V. All rights reserved.

1. Introduction

Traffic scene analysis is a very important topic in computervision and intelligent systems [1–6]. Traffic signs are designed toinform drivers of the current condition and other important in-formation of the road. They are rigid and simple shapes with eye-catching colors and the information they carry is easy to under-stand. However, accidents may still occur when drivers do not payattention to a traffic sign in time. Hence, it is very important todesign an automatic real-time driver assistance system to detectand recognize traffic signs.

Although the research community and industry in the field ofcomputer vision have achieved significant progresses in traffic signdetection and recognition in the past decades, there are two maininevitable difficulties. One is poor image quality due to low re-solution, motion blur and noise. The other is uncontrolled en-vironmental factors including weather conditions, complex back-ground, variable illumination, sign color fading, etc. Fig. 1 showssome difficult examples. These make the traffic sign detection andrecognition task still an open problem.

In order to tackle these problems, we propose a novel and ef-ficient method to detect and recognize traffic signs. Generally,traffic signs have a distinctive property that they always appear onthe two sides of the road to be traveled. Taking advantage of thisproperty, the proposed method extracts traffic sign proposals by

the guidance of fully convolutional network, unlike traditionaltraffic sign detection and recognition methods, which usually startfrom color segmentation [7–11], shape detection [12–15] or slid-ing-window scanning [16,17] to find traffic sign regions. This caneffectively reduce the search range of traffic signs, so as to reducethe number of proposals and improve the efficiency.

The pipeline of the proposed method for traffic sign detectionis illustrated in Fig. 2, which is mainly composed of two parts. Oneis the proposal extraction stage guided by fully convolutionalnetwork [18] and EdgeBox [19]. The other is the traffic sign clas-sification stage using convolutional neural network [20]. Given ascene image, coarse regions of traffic signs are generated by fullyconvolutional network firstly. Then, proposals of traffic signs areextracted by EdgeBox from the coarse sign regions. Finally, trafficsigns are identified and false positives are eliminated by a trainedconvolutional neural network classifier and the optimal boundingboxes are retained by non-maximal suppression.

Different from the previous methods on both traffic sign de-tection and general object detection, a novel FCN guided objectproposal method is proposed. In the case of traffic sign detectionwhere pixel-level annotation is unavailable, we propose to trainFCN using boundingbox-level annotation which is a typical weaksupervision for the semantic segmentation method like FCN.

The proposed method is validated on a publicly available da-tabase: Swedish Traffic Signs Dataset (STSD) [21]. The experi-mental results demonstrate that the proposed method has a muchhigher detection rate than other methods while producing lessfalse positives. In addition, the proposed method is not timeconsuming, which is potentially to be applied in real time

Page 2: Traffic sign detection and recognition using fully

Fig. 1. Some difficult examples for traffic sign detection and recognition.

Fig. 2. The pipeline of the proposed method.

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766 759

applications by utilizing the computational power of GPU devices.Object detection using proposal is popular in recent literatures,

especially after R-CNN [22] series works have achieved the state-of-the-art object detection performance on PASCAL VOC [23] andILSVRC detection task [24]. The proposed method follows R-CNN,but there are at least three major differences from R-CNN: (1) Weextent the R-CNN in a new application, the challenging traffic signdetection problem. (2) We adopt a much faster object proposalmethod, EdgeBox, rather than the selective search method [25]which is more accurate but slower. (3) We proposed a novel FCNguided object proposal which is both fast and accurate.

In summary, the main contributions of this work can be con-cluded in three folds: first, this paper proposes a new frameworkfor traffic sign detection and recognition based on object proposal,which works in a coarse-to-fine manner. Second, due to the gui-dance of heat maps of traffic signs that generated by fully con-volutional networks, the searching area of signs is drasticallynarrowed. Third, an efficient multi-class CNN model is trained for

sign recognition with the bootstrapping strategy. The proposedmethod achieves the state-of-the art performance on the trafficsign benchmark.

The remainder of this paper is organized as follows: Section 2presents recent works related to the proposed method. Section 3details the proposed method, including coarse sign region seg-mentation, proposal extraction and sign classification. In Section 4,the experimental results and discussions are presented. Finally, theconclusions and future works are given in Section 5.

2. Related work

Over the last decade, research in traffic sign detection and re-cognition has grown rapidly. A large number of novel ideas and ef-fective methods have been proposed [7–10,12,13,16,17,21,26–40].Usually, the detection part hunts potential regions of traffic signs andthe recognition part determines the category of traffic signs.

Page 3: Traffic sign detection and recognition using fully

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766760

The previous traffic detection methods can be divided intothree main categories: color-based methods, shape-based meth-ods and sliding window based methods. Color-based methods arecommonly used because traffic signs are usually red, yellow andblue. Since these methods are sensitive to changes in illumination,several color spaces are often used, such as normalized RGB [7]and HSI [8,9]. In [10], red, blue and yellow pixels are enhanced inthe given color channels to extract the corresponding color blobs.These methods can improve the performance to a certain extent,but depend on a set of thresholds prescribed by the designer.Moreover, color of traffic signs is likely to fade. Shape-basedmethods exploit the shape of traffic signs, such as circle, triangleand square, to detect traffic signs, which are more robust tochanges in illumination. The Hough Transform is commonly usedto detect traffic signs such as circles and triangles [12]. However,these methods are time-consuming. Motivated by this, Ever-ingham et al. [13] proposed a radial symmetry detector to decreasethe processing time. They are however often sensitive to noise inclutter environment and parts of the traffic signs may be occluded,making detection harder. Making use of the advantages of theabove methods, some methods for feature fusing combine colorand shape to detect traffic signs [27,28,41]. Unfortunately, all themethods above are likely to miss traffic signs for various reasons.To solve this problem, sliding window based methods using HOG[16] and Viola-Jones [17] have been proposed to detect traffic signand achieve good results. These methods consider regions in thewhole image so that the number of candidates is huge. Since therequirement of real-time of traffic sign detection and recognition,these methods are far from practical use.

There are a number of popular traffic sign recognition methodsusing different classifiers such as Nearest Neighbor [10], RandomForests, Neural Networks [30–32], Support Vector Machines[9,33,40], CNN [38], etc. In [7] two neural networks were trained torecognize the detected windows. In [9], candidate blobs are firstdetected by thresholding HSI color, then the detected blobs areclassified by linear SVMs according to shape. And the final re-cognition process is carried out by different SVMs for every class.In [38], a biologically-inspired, multilayer architecture based onconvolutional networks is proposed for traffic sign recognition.

The proposed method is inspired by object proposal methods. Anumber of recent works have been devoted to produce a set ofreliable bounding boxes that may contain objects, namely objectproposals. Typical methods of object proposals include EdgeBox[19], BING [42], Geodesic Object Proposals [43] (GOP), selectivesearch [25]. They can largely narrow the search ranges meanwhileimproving the speed and accuracy of object detection, comparingto the sliding windows searching method. Therefore, a series ofmethods of object detection [22,44,45] based on object proposalare proposed and have achieved advanced performance. Thecandidates of object are obtained by the proposal methods, andrich or efficient features are extracted to further classify eachproposal [46–49]. In the work of R-CNN [22], Ross et al. use se-lective search to obtain potential bounding boxes of objects andextract discriminative CNN features of them to train a set of class-specific linear SVMs. And DeepID-Net [44] refines the results ofR-CNN by a deformable deep convolutional neural network, whichcan model the deformation of object parts with geometric con-straint and penalty.

3. Approach

In this section, we present the details for the detection pipelineillustrated in Fig. 2. It includes two main components, the novelproposal extraction component and the CNN classification com-ponent with training and testing details.

3.1. Proposal extraction

Object proposal methods have been widely used in object de-tection in recent works [22,44,45]. Compared to traditional slidingwindows for object detection, object proposal methods provide asmaller number of candidate bounding boxes with high recall.Object detection is accomplished by classifying object proposalsrather than all sliding windows, which makes object detectionfaster and more accurate classifier with slow speed, e.g., deep CNN,possible to be utilized. The proposed detection framework is basedon object proposal but different from previous object proposalmethods. Fig. 3 shows the proposal extraction process. Differentfrom generic object, traffic signs always appear in some specificposition at the traffic scenarios. As shown in the top row of Fig. 3,traffic signs always appear in the two sides of the road. Therefore,rather than extracting object proposals from the whole image, weuse fully convolutional network to find coarse sign regions, whichgreatly narrows the search range of traffic signs.

3.1.1. Coarse sign region segmentation using fully convolutionalnetwork

Identifying region of interest (ROI) is an effective strategy inobject detection. Previous works usually adopt coarse-to-finestrategies which help to significantly speed up object detectionprocess [50,51]. However, identifying ROI is non-trivial. Previousmethods usually solve this problem by training a fast classifierusing weak features. In this paper, we propose a new solution bytraining fully convolutional network (FCN) using bounding boxlevel supervision. Though this kind of supervision is weak andnoisy, since bounding box contains non-object region, it is suitablefor coarse object detection.

Recently, FCN has achieved advanced performance on objectsegmentation and edge detection [18,52,53]. FCN is composed ofconvolutional layers, rectified units and sampling layers and hasno fully connected layers, which make that the net can do pixel-level prediction on images of arbitrary size efficiently. The featuremaps from different convolutional layers in FCN can be easilyfused through up-sampling layers to generate a rich and hier-archical representation.

We adopt the holistically-nested edge detection network usedin [52] to generate coarse sign regions. As shown in Fig. 4, differentfeature maps from each convolutional stage are fused to generaterich and hierarchical features by a deconvolutional layer whichmakes different feature maps the same size. Therefore, the hier-archical features of each position (i,j) at the layer of feature fusioncan be formulated by:

∑( ) = ( ))( )=

F i j f i s j s, / , /1k

N

k k1

where ( )F i j, is the feature representation at the position of (i,j), f isthe feature map from k-th layer, sk is the stride (upsampling factor)of the corresponding deconvolutional layer and N is the number ofconvolutional feature maps. Then, the fused features are fed into a1�1 convolutional layer which plays the role of fully connectedlayers. When training, the pixels inside the bounding boxes ofground truth are regarded as positive, otherwise negative.

3.1.2. Traffic sign proposalsObject proposal methods [19,42,43,25] are widely used in ob-

ject detection, which generate a set of reliable candidate boundingboxes that likely containing objects. Among these works, EdgeBoxprovides candidate bounding boxes with high recall while con-suming little time. Therefore, we utilize the EdgeBox to generatesign proposals for traffic sign detection. Since the proposals’ ex-traction based on coarse sign regions, the recall of traffic sign

Page 4: Traffic sign detection and recognition using fully

Fig. 3. The process of proposal extraction. The top row shows the original scene images. The second row shows heat maps generated by fully convolutional network. Thethird row shows coarse sign regions. The bottom row shows sign proposals extracted by EdgeBox based on the coarse regions. (For interpretation of the references to color inthis figure caption, the reader is referred to the web version of this paper.)

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766 761

proposals easily achieves high value with a relatively small pro-posal number, comparing to the original EdgeBox. As shown in thebottom row of Fig. 3, the red bounding boxes are the traffic signproposals extracted by EdgeBox.

In summary, the proposed FCN guided object proposal has twoadvantages: (1) It improves object detection speed by removing mostbackground regions. (2) It improves object detection accuracy bysuppressing false positives since it is discriminatively and deeplytrained. The strategy of FCN guided object proposal is general forobject detection and can be applied in other object detection tasks.

3.2. Traffic sign classification

After traffic sign proposals are extracted, we perform traffic signclassification to differentiate traffic signs from different classes andbackground. Inspired by the excellent performance of image classi-fication using deep CNN [20], we utilize deep CNN to perform thetask of traffic sign classification. The architecture, training procedureand testing procedure for object detection are as follows.

Architecture: In our network, each convolutional layer is fol-lowed by a batch normalization layer (which significantly

Page 5: Traffic sign detection and recognition using fully

Fig. 4. The architecture of fully convolutional network. Feature maps from different convolutional layers are fused after the deconvolutional layers which can be replaced bya 1�1 convolutional layer and an upsampling layer.

Table 1Settings of CNN used for traffic signclassification.

Layers Settings

Conv-1 ks¼5�5, ps¼2, nm¼64, ss¼1Conv-2 ks¼5�5, ps¼2, nm¼256, ss¼1Conv-3 ks¼3�3, ps¼1, nm¼384, ss¼1Conv-4 ks¼3�3, ps¼1, nm¼512, ss¼1FC-5 nn¼4096; DropOutFC-6 nn¼4096; DropOutFC-7 nn¼N

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766762

improves convergence speed), a ReLU layer and a max-poolinglayer with 2 strides. The input image size is 64�64, and the de-tails of CNN settings are listed in Table 1, where ks is kernel size, psis padding size, nm is the number of feature maps, ss is the stridesize for each convolutional layer and nn is the output number ofthe fully connected layer. Note that the size of this CNN is rela-tively small comparing with the popular AlexNet [20]. This is be-cause the classification task is easier since we have the FCN guidedobject proposal for accurate object localization and noisy back-ground removal. And a small network has fast speed in detectionstage.

Bootstrapping training: In order to improve the discriminativeability of CNN, we adopt the bootstrapping method to mining hardnegative for training CNN. Since the complexity and diversity ofscene image, traffic signs are easily affected by some factors in-cluding illumination, occlusion, noise, etc. In practice, we use allpositive proposals and randomly select the same number of ne-gative proposals from the training images in the first round oftraining. After the convergence of network training, we use themodel to find the wrongly classified samples and add them to thetraining set to further optimize the network.

Non-maximum suppression: When testing, all the sign proposalsare assigned scores by the trained CNN model. False proposals areeliminated according to their low scores, and the local optimalbounding boxes of traffic signs are remained via standard non-maximum suppression as the final output of the proposed trafficsign detection system.

4. Experiments

4.1. Dataset and evaluation protocol

The proposed method is evaluated on a publicly available da-tabase: Swedish Traffic Signs Dataset (STSD) [21]. STSD consists oftwo sets (Set1 and Set2) that contain more than 20,000 frames intotal recorded from Swedish highways and city roads and everyfifth frame has been manually labeled. Part0 for each set is an-notated (roughly 2000 images) while Part1–Part4 are not anno-tated. Some examples from STSD can be seen in Fig. 5. The an-notations for each image present status, location, type, and namefor each traffic sign. Sign status tells us whether the traffic signsare on the road being travelled or on a side road and whether thetraffic signs are visible or occluded or blurred. There are 4 types oftraffic signs, which are information, mandatory, prohibitory and

warning. These types can be subdivided into sign names, such aspedestrian crossing, pass right side, no stopping no standing, 50sign, priority road, give way, 70 sign, 80 sign, 100 sign, no parkingand others.

The evaluation was conducted using Part0 of Set1 as thetraining set and Part0 of Set2 as the testing set. A detection isconsidered to be true positive if the IoU (intersection/union) be-tween it and the annotated bounding box is more than 0.5. Inorder to compare with the results reported in [21], only the visibletraffic signs whose sizes are larger than 50 pixels are considered.Because the size of traffic signs to be detected at a distance ofabout 30 m on motorways corresponds to a size of about 50 pixelsin the images.

4.2. Implementation details

Two networks are used in the proposed method: one is fullyconvolutional network trained at pixel-level to predict the coarsesign regions, the other is CNN that makes final classification of signproposals. Both networks are fine-tuned and trained based onSTSD.

FCN: The architecture of fully convolutional network to gen-erate the coarse sign regions is shown in Fig. 4. All the 5 con-volutional stages are derived from VGG-16 model, and the de-convolutional layers are replaced by a 1�1 convolutional layerand a up-sampling layer to make the different feature maps thathave the same size. The final pixel-level prediction is implementedby a 1�1 convolutional layer based on the fused feature maps.Although FCN is insensitive to image size, we randomly crop50,000 600�600–800�800 square patches from training imagesand resize them to 500�500 patches.

Proposal generation: Image patches are fed to the FCN to

Page 6: Traffic sign detection and recognition using fully

Fig. 5. Examples from the dataset.

Table 2Time cost for each stage.

Stages Time cost (s)

Coarse sign regions segmentation 0.30Proposals extraction 0.05Final classification 0.1

Total 0.45

10 10 10 10 100

0.2

0.4

0.6

0.8

1

Max num of boxes

Det

ectio

n R

ate

edgeboxFCN+edgebox

Fig. 6. Detection rate of different max number of boxes of EdgeBox using and notusing FCN.

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766 763

generate the heat maps shown in the second row of Fig. 3. Andthen, we use the operations of Gaussian smoothing and binariza-tion to find the coarse sign regions. Finally, EdgeBox with defaultparameters except the max number of boxes setting to 50 is usedto extract sign proposals from the coarse sign regions, illustratedin the bottom row of Fig. 3.

CNN: The configuration of CNN used for final traffic signsclassification is shown in Table 1, which is a typical convolutionalneural network with 4 convolutional layers and 3 fully connectedlayers. The traffic sign proposals with IoU over 0.5 are regarded aspositive samples, otherwise are negative. Since the proposalnumbers of different traffic sign classes are unbalance, we aug-ment a large training set based on the sign proposals from trainingimages. Since the proposal numbers of different traffic sign classesare unbalance, we randomly crop several image regions aroundground truth with IoU over 0.5 to enlarge the training set. We alsoadd the Gaussian blur to the image region with a random value(3–5). Finally, each image region is resized to 64�64. After data ar-gumentation, around 30 K samples for each sign class are selected,while the negative samples are randomly selected from thebackground. We use the Stochastic Gradient Descent to train themodel with following parameters: mini-batch size 128, baselearning rate (0.01), momentum (0.9) and weight decay (0.0002).With the use of bootstrap method, the proposed model achievedaround 99.7% on validation dataset.

All the stages of the proposed method are executed on a work-station with the following configuration: 8-core [email protected] GHz,GTX-TitanX GPU with 12 G memory, 64 G RAM. The time cost foreach stage is listed in Table 2. The time cost is mainly focused onthe GPU operation. In addition, before generating coarse sign re-gion heat map, we resize the height of images to 500 pixels whilekeeping aspect ratio.

Page 7: Traffic sign detection and recognition using fully

Table 3Performance on STSD for the method presented in [21,54], R-CNN and the proposed method.

Sign name [21] [54] R-CNN Proposed method

Precision (%) Recall (%) Precision (%) Recall (%) Precision (%) Recall (%) Precision (%) Recall (%)

PEDESTRIAN CROSSING 96.03 91.77 98.52 93.45 87.9 87.2 100 95.2PASS RIGHT SIDE 100 95.33 100 97.53 93.8 93.8 95.3 93.8NO STOPPING NO STANDING 97.14 77.27 99.2 81.46 66.8 71.7 100 7550 SIGN 100 76.12 100 80.56 100 100 100 100PRIORITY ROAD 98.66 47.76 97.89 79.68 95.7 97.8 100 98.9GIVE WAY 59.26 47.76 71.5 52.39 79.4 90 96.7 96.770 SIGN 92.6 86.2 100 10080 SIGN 100 77.3 94.4 77.3100 SIGN 100 100 90.5 100NO PARKING 96.3 68.4 100 92.1

Average (the first 6 classes) 91.84 77.08 94.52 80.85 90.08 87.27 98.67 93.27Average (all) 91.25 87.24 97.69 92.9

Table 4Time cost for the method presentedin [54], R-CNN and the proposedmethod.

Methods Time cost (s)

[54] 0.05–0.5R-CNN 15.7Proposed method 0.45

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766764

4.3. Experimental results and analysis

4.3.1. Effect of FCN on the number of proposalsThe proposed method utilize FCN to extract coarse sign regions

to reduce the number of proposals of EdgeBox, while ensuring thatthe detection rate is not reduced. Fig. 6 shows the detection rateunder different conditions. Where edgebox denotes the detectionrate by setting different max number of boxes of EdgeBox.

Fig. 7. (a) Some examples of correctly detected and recog

FCNþedgebox denotes the detection rate by setting different maxnumber of boxes of EdgeBox from every coarse sign region ex-tracted by FCN. As can be seen, the detection rate can achieve 100%when the max number of boxes extracted from coarse sign regionsis only 1 order of magnitude 10. However, 3 order of magnitude 10of max number of boxes are needed when simply using originalEdgeBox to extract proposals. A large reduction in the number ofproposals can also reduce the negative samples for training toensure the balance of positive and negative samples, thus im-proving the accuracy of CNN classification.

4.3.2. Traffic signs detection performanceThe performance on STSD of the methods presented in [21,54],

R-CNN and the proposed method is summarized in Table 3. Theproposed method achieves an average precision of 98.67% and anaverage recall of 93.27%. Compared to the method in [21,54], therecall of the proposed method has a great improvement and theprecision of the proposed method also has a certain improvement.

nized traffic signs. (b) False positives of traffic signs.

Page 8: Traffic sign detection and recognition using fully

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766 765

In order to show the effect of FCN, we also test the R-CNN methodwhich uses original EdgeBox to extract proposals on this dataset,see Table 3. As can be seen that the proposed method outperformsthe R-CNN method in both precision and recall by more than 5%.Furthermore, the comparison of time cost is conducted in Table 4.Both the methods in [54] and the proposed method can meet therequirement for real-time detection and recognition only con-suming less than 0.5 s. However, the R-CNN method is time-con-suming due to the large number of proposals to be handled.

In addition to the traffic signs used in [21], we also detect someother sign classes that have relatively large number of instances.As can be seen, the proposed method works well on differenttraffic signs. Compared to other classes, the recall of no stoppingno standing and 80 sign is relatively low. This is because thenumber of samples of these two class is relatively small. Moreover,no stopping no standing is very similar to no parking, and 80 signis very similar to 50 sign, 70 sign and so on.

Fig. 7(a) shows some examples of correctly detected and re-cognized traffic signs. There two main causes of false positives.One is non-maximum suppression may eliminate the correct boxbecause the score of the correct box may lower than the incorrectone, see top of Fig. 7. Another is due to the high similarity ofbackground and traffic signs, see bottom of Fig. 7.

5. Conclusions

In this paper, we propose a novel and efficient method to detectand recognize traffic signs. The main contributions of this paperare the following: (1) We propose a new framework of traffic signdetection and recognition based on proposals by the guidance offully convolutional network, which largely reduces the search areaof traffic signs under the premise of ensuring the detection rate.(2) We extent the R-CNN in a new application, the challengingtraffic sign detection and recognition problem, by using a muchfaster object proposal method, EdgeBox. (3) A large reduction inthe number of proposals can not only largely reduce the time cost,but also eliminate negative samples for training mostly to ensurethe balance of positive and negative samples which improves theaccuracy of the CNN classifier. Experimental results on STSD showthat the proposed method achieves the state-of-the-art results. Inthe future, we will study learning an end-to-end network togenerate the proposed FCN guided proposals and developing real-time traffic sign system based on the algorithms in this paper.Another future task is to study reading texts within traffic signs[55,56] for an automatic driving system.

References

[1] Z. Zhang, T. Tan, K. Huang, Y. Wang, Practical camera calibration from movingobjects for traffic scene surveillance, IEEE Trans. Circuits Syst. Video Technol.23 (3) (2013) 518–533.

[2] Z. Zhang, K. Huang, Y. Wang, M. Li, View independent object classification byexploring scene consistency information for traffic scene surveillance, Neu-rocomputing 99 (2013) 250–260.

[3] C. Yao, X. Bai, W. Liu, L. Latecki, Human detection using learned part alphabetand pose dictionary, in: Proceedings of ECCV, 2014.

[4] C. Hu, X. Bai, L. Qi, X. Wang, G. Xue, L. Mei, Learning discriminative pattern forreal-time car brand recognition, IEEE Trans. Intell. Transp. Syst. 16 (6) (2015)3170–3181.

[5] Y. Xia, W. Xu, L. Zhang, X. Shi, K. Mao, Integrating 3d structure into traffic sceneunderstanding with rgb-d data, Neurocomputing 151 (2015) 700–709.

[6] Q. Ling, J. Yan, F. Li, Y. Zhang, A background modeling and foreground seg-mentation approach based on the feedback of moving objects in traffic sur-veillance systems, Neurocomputing 133 (2014) 32–45.

[7] A. De La Escalera, L.E. Moreno, M.A. Salichs, J.M. Armingol, Road traffic signdetection and classification, IEEE Trans. Ind. Electron. 44 (6) (1997) 848–859.

[8] A. De la Escalera, J.M. Armingol, M. Mata, Traffic sign recognition and analysisfor intelligent vehicles, Image Vis. Comput. 21 (3) (2003) 247–258.

[9] S. Maldonado-Bascón, S. Lafuente-Arroyo, P. Gil-Jimenez, H. Gómez-Moreno,F. López-Ferreras, Road-sign detection and recognition based on supportvector machines, IEEE Trans. Intell. Transp. Syst. 8 (2) (2007) 264–278.

[10] A. Ruta, Y. Li, X. Liu, Real-time traffic sign recognition from video by class-specific discriminative features, Pattern Recognit. 43 (1) (2010) 416–430.

[11] J. Lillo-Castellano, I. Mora-Jimenez, C. Figuera-Pozuelo, J. Rojo-Alvarez, Trafficsign segmentation and classification using statistical learning methods, Neu-rocomputing 153 (2015) 286–299.

[12] M.Á. García-Garrido, M.Á. Sotelo, E. Martín-Gorostiza, Fast road sign detectionusing Hough transform for assisted driving of road vehicles, in: ComputerAided Systems Theory–EUROCAST 2005, Springer, Las Palmas de Gran Canaria,Spain, 2005, pp. 543–548.

[13] G. Loy, N. Barnes, Fast shape-based road sign detection for a driver assistancesystem, in: 2004 Proceedings of IEEE/RSJ International Conference on In-telligent Robots and Systems (IROS 2004), vol. 1, IEEE, Sendai, Japan, 2004, pp.70–75.

[14] X. Bai, S. Bai, Z. Zhu, L. Latecki, 3d shape matching via two layer coding, IEEETrans. Pattern Anal. Mach. Intell. 37 (12) (2015) 2361–2373.

[15] H. Li, F. Sun, L. Liu, L. Wang, A novel traffic sign detection method via colorsegmentation and robust shape matching, Neurocomputing 169 (2015) 77–88.

[16] I.M. Creusen, R.G. Wijnhoven, E. Herbschleb, P. De With, Color exploitation inhog-based traffic sign detection, in: Proceedings of ICIP, IEEE, Hong Kong,China, 2010, pp. 2669–2672.

[17] X. Baró, S. Escalera, J. Vitrià, O. Pujol, P. Radeva, Traffic sign recognition usingevolutionary adaboost detection and forest-ecoc classification, IEEE Trans.Intell. Transp. Syst. 10 (1) (2009) 113–126.

[18] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semanticsegmentation, in: Proceedings of CVPR, 2015, pp. 3431–3440.

[19] C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in:Proceedings of ECCV, Springer, Zurich, Switzerland, 2014, pp. 391–405.

[20] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deepconvolutional neural networks, in: Advances in Neural Information ProcessingSystems, 2012, pp. 1097–1105.

[21] F. Larsson, M. Felsberg, Using Fourier descriptors and spatial models for trafficsign recognition, in: Image Analysis, Springer, Ystad, Sweden, 2011, pp. 238–249.

[22] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurateobject detection and semantic segmentation, in: Proceedings of CVPR, IEEE,Columbus, OH, USA, 2014, pp. 580–587.

[23] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascalvisual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338.

[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scalevisual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252, http://dx.doi.org/10.1007/s11263-015-0816-y.

[25] J.R. Uijlings, K.E. Van de Sande, T. Gevers, A.W. Smeulders, Selective search forobject recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.

[26] Y. Zhu, C. Yao, X. Bai, Scene text detection and recognition: recent advancesand future trends, Front. Comput. Sci. 10 (1) (2016) 19–36.

[27] G. Piccioli, E. De Micheli, P. Parodi, M. Campani, Robust method for road signdetection and recognition, Image Vis. Comput. 14 (3) (1996) 209–223.

[28] X.W. Gao, L. Podladchikova, D. Shaposhnikov, K. Hong, N. Shevtsova, Re-cognition of traffic signs based on their colour and shape features extractedusing human vision models, J. Vis. Commun. Image Represent. 17 (4) (2006)675–685.

[29] G. Xia, J. Delon, Y. Gousseau, Accurate junction detection and characterizationin natural images, Int. J. Comput. Vis. 106 (1) (2014) 31–56.

[30] M.A. Garcia-Garrido, M.A. Sotelo, E. Martm-Gorostiza, Fast traffic sign detec-tion and recognition under changing lighting conditions, in: 2006 IEEE In-telligent Transportation Systems Conference, ITSC'06, IEEE, Toronto, ON,Canada, 2006, pp. 811–816.

[31] D. Cireşan, U. Meier, J. Masci, J. Schmidhuber, A committee of neural networksfor traffic sign classification, in: The 2011 International Joint Conference onNeural Networks (IJCNN), IEEE, San Jose, California, USA, 2011, pp. 1918–1921.

[32] D. Cireşan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neuralnetwork for traffic sign classification, Neural Netw. 32 (2012) 333–338.

[33] F. Zaklouta, B. Stanciulescu, Real-time traffic sign recognition in three stages,Robot. Auton. Syst. 62 (1) (2014) 16–24.

[34] R. Timofte, K. Zimmermann, L. Van Gool, Multi-view traffic sign detection,recognition, and 3d localisation, Mach. Vis. Appl. 25 (3) (2014) 633–647.

[35] J. Miura, T. Kanda, Y. Shirai, An active vision system for real-time traffic signrecognition, in: 2000 IEEE Proceedings of Intelligent Transportation Systems,IEEE, Dearborn, MI, USA, 2000, pp. 52–57.

[36] C. Yao, X. Bai, W. Liu, A unified framework for multi-oriented text detectionand recognition, IEEE Trans. Image Process. 23 (11) (2014) 4737–4749.

[37] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, T. Koehler, A system for trafficsign detection, tracking, and recognition using color, shape, and motion in-formation, in: 2005 IEEE Proceedings of Intelligent Vehicles Symposium, IEEE,Las Vegas, NV, USA, 2005, pp. 255–260.

[38] P. Sermanet, Y. LeCun, Traffic sign recognition with multi-scale convolutionalnetworks, in: The 2011 International Joint Conference on Neural Networks(IJCNN), IEEE, San Jose, California, USA, 2011, pp. 2809–2813.

[39] M. Mathias, R. Timofte, R. Benenson, L. Van Gool, Traffic sign recognition howfar are we from the solution? in: The 2013 International Joint Conference onNeural Networks (IJCNN), IEEE, Dallas, TX, USA, 2013, pp. 1–8.

[40] Z.-L. Sun, H. Wang, W.-S. Lau, G. Seet, D. Wang, Application of bw-elm model

Page 9: Traffic sign detection and recognition using fully

Y. Zhu et al. / Neurocomputing 214 (2016) 758–766766

on traffic sign recognition, Neurocomputing 128 (2014) 153–159.[41] Y. Zhou, X. Bai, W. Liu, L. Latecki, Similarity fusion for visual tracking, Int. J.

Comput. Vis. (2016) 1–27, http://dx.doi.org/10.1007/s11263-015-0879-9.[42] M.-M. Cheng, Z. Zhang, W.-Y. Lin, P. Torr, Bing: Binarized normed gradients for

objectness estimation at 300 fps, in: Proceedings of CVPR, IEEE, Columbus, OH,USA, 2014, pp. 3286–3293.

[43] P. Krähenbühl, V. Koltun, Geodesic object proposals, in: Proceedings of ECCV,Springer, Zurich, Switzerland, 2014, pp. 725–739.

[44] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C.Loy, et al., Deepid-net: Deformable deep convolutional neural networks forobject detection, in: Proceedings of CVPR, 2015, pp. 2403–2412.

[45] R. Girshick, Fast r-cnn, in: Proceedings of ICCV, 2015, pp. 1440–1448.[46] X. Wang, X. Duan, X. Bai, Deep sketch feature for cross-domain image re-

trieval, Neurocomputing http://dx.doi.org/10.1016/j.neucom.2016.04.046.[47] F. Hu, G.-S. Xia, Z. Wang, X. Huang, L. Zhang, H. Sun, Unsupervised feature

learning via spectral clustering of multidimensional patches for remotelysensed scene classification, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 5 (8)(2015) 2015–2030.

[48] W. Yang, X. Yin, G.-S. Xia, Learning high-level features for satellite imageclassification with limited labeled samples, IEEE Trans. Geosci. Remote Sens.53 (8) (2015) 4472–4482.

[49] F. Yang, G.-S. Xia, G. Liu, L. Zhang, X. Huang, Dynamic texture recognition byaggregating spatial and temporal features via ensemble svms, Neuro-computing 173 (2016) 1310–1321.

[50] R. Benenson, M. Mathias, R. Timofte, L. Van Gool, Pedestrian detection at 100frames per second, in: Proceedings of CVPR, IEEE, Providence, RI, USA, 2012,pp. 2903–2910.

[51] M. Pedersoli, A. Vedaldi, J. Gonzalez, A coarse-to-fine approach for fast de-formable object detection, in: Proceedings of CVPR, IEEE, Colorado Springs, CO,USA, 2011, pp. 1353–1360.

[52] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of ICCV, 2015,pp. 1395–1403.

[53] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, X. Bai, Object skeleton extractionin natural images by fusing scale-associated deep side outputs, in: IEEE Con-ference on CVPR, IEEE, Las Vegas, NE, USA, 2016.

[54] T. Chen, S. Lu, Accurate and efficient traffic sign detection using discriminativeadaboost and support vector regression, IEEE Trans. Veh. Technol. 65 (6)(2016) 4006–4015, http://dx.doi.org/10.1109/TVT.2015.2500275.

[55] X. Bai, C. Yao, W. Liu, Strokelets: a learned multi-scale mid-level representa-tion for scene text recognition, IEEE Trans. Image Process. 25 (6) (2016)2789–2802.

[56] B. Shi, X. Bai, C. Yao, Script identification in the wild via discriminative con-volutional neural network, Pattern Recognit. 52 (2016) 448–458.

Yingying Zhu was born in 1988. She received the B.S.degree in electronics and information engineering fromthe Huazhong University of Science and Technology(HUST), Wuhan, China, in 2011. She is currently a Ph.D.student with the School of Electronic Information andCommunications, HUST. Her research areas mainly in-clude text/traffic sign detection and recognition innatural images.

Chengquan Zhang was born in 1990. He received theB.S. degree in electronics and information engineeringfrom the Huazhong University of Science and Tech-nology (HUST), Wuhan, China in 2014. He is currently aMaster student with the school of Electronic Informa-tion and Communications, HUST. His main researchinterests include image classification and scene textdetection.

Duoyou Zhou was born in 1993. He received the B.S.degree in college of Information Science & Technologyfrom the Hainan University (HNU), Haikou, China in2015. He is currently a Master student with the schoolof Electronic Information and Communications, HUST.His main research interests include object segmenta-tion and scene text detection.

Xinggang Wang is an Assistant Professor of School ofElectronics Information and Communications of Huaz-hong University of Science and Technology. He receivedhis Bachelors' degree in communication and informa-tion system and Ph.D. degree in Computer Vision bothfrom Huazhong University of Science and Technology.From May 2010 to July 2011, he was with the Depart-ment of Computer and Information Science, TempleUniversity, Philadelphia, PA, as a visiting scholar. FromFebruary 2013 to September 2013, he was with theUniversity of California, Los Angeles, as a VisitingGraduate Researcher. He is a Reviewer of IEEE Trans-

action on Cybernetics, Pattern Recognition, Computer

Vision and Image Understanding, Neurocomputing, CVPR, ICCV, ECCV, etc. His re-search interests include computer vision and machine learning.

Bai Xiang is currently a Professor with the School ofElectronic Information and Communications, Huaz-hong University of Science and Technology (HUST). Hereceived the B.S., M.S., and Ph.D. degrees from HUST in2003, 2005, and 2009 respectively. His research inter-ests include computer vision and pattern recognition,specifically including object recognition, shape analysis,scene text recognition and intelligent systems. He isnow serving as the Editorial Board Member of PatternRecognition Letters, Neurocomputing, Frontier ofComputer Science.

Wenyu Liu was born in 1963. He received the B.S. de-gree in Computer Science from Tsinghua University,Beijing, China, in 1986, and the M.S. and Ph.D. degrees,both in Electronics and Information Engineering, fromHuazhong University of Science & Technology (HUST),Wuhan, China, in 1991 and 2001, respectively. He isnow a professor and associate dean of the School ofElectronic Information and Communications, HUST. Hiscurrent research areas include computer vision, multi-media, and sensor network. He is a senior member ofIEEE.