spatial as deep: spatial cnn for trafﬁc scene understanding

Spatial As Deep: Spatial CNN for Traffic Scene Understanding

Xingang Pan1, Jianping Shi2, Ping Luo1, Xiaogang Wang1, and Xiaoou Tang11The Chinese University of Hong Kong 2SenseTime Group Limited

{px117, pluo, xtang}@ie.cuhk.edu.hk, [email protected], [email protected]

Abstract

Convolutional neural networks (CNNs) are usually built bystacking convolutional operations layer-by-layer. AlthoughCNN has shown strong capability to extract semantics fromraw pixels, its capacity to capture spatial relationships of pix-els across rows and columns of an image is not fully ex-plored. These relationships are important to learn semanticobjects with strong shape priors but weak appearance coher-ences, such as traffic lanes, which are often occluded or noteven painted on the road surface as shown in Fig. 1 (a). Inthis paper, we propose Spatial CNN (SCNN), which general-izes traditional deep layer-by-layer convolutions to slice-by-slice convolutions within feature maps, thus enabling mes-sage passings between pixels across rows and columns in alayer. Such SCNN is particular suitable for long continuousshape structure or large objects, with strong spatial relation-ship but less appearance clues, such as traffic lanes, poles, andwall. We apply SCNN on a newly released very challengingtraffic lane detection dataset and Cityscapse dataset1. The re-sults show that SCNN could learn the spatial relationship forstructure output and significantly improves the performance.We show that SCNN outperforms the recurrent neural net-work (RNN) based ReNet and MRF+CNN (MRFNet) in thelane detection dataset by 8.7% and 4.6% respectively. More-over, our SCNN won the 1st place on the TuSimple Bench-mark Lane Detection Challenge, with an accuracy of 96.53%.

IntroductionIn recent years, autonomous driving has received much at-tention in both academy and industry. One of the most chal-lenging task of autonomous driving is traffic scene under-standing, which comprises computer vision tasks like lanedetection and semantic segmentation. Lane detection helpsto guide vehicles and could be used in driving assistancesystem (Urmson et al. 2008), while semantic segmentationprovides more detailed positions about surrounding objectslike vehicles or pedestrians. In real applications, however,these tasks could be very challenging considering the manyharsh scenarios, including bad weather conditions, dim ordazzle light, etc. Another challenge of traffic scene under-standing is that in many cases, especially in lane detection,we need to tackle objects with strong structure prior but less

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1Code is available at https://github.com/XingangPan/SCNN

Figure 1: Comparison between CNN and SCNN in (a) lanedetection and (b) semantic segmentation. For each example,from left to right are: input image, output of CNN, outputof SCNN. It can be seen that SCNN could better capture thelong continuous shape prior of lane markings and poles andfix the disconnected parts in CNN.

appearance clues like lane markings and poles, which havelong continuous shape and might be occluded. For instance,in the first example in Fig. 1 (a), the car at the right side fullyoccludes the rightmost lane marking.

Although CNN based methods (Krizhevsky, Sutskever,and Hinton 2012; Long, Shelhamer, and Darrell 2015) havepushed scene understanding to a new level thanks to thestrong representation learning ability. It is still not perform-ing well for objects having long structure region and couldbe occluded, such as the lane markings and poles shown inthe red bounding boxes in Fig. 1. However, humans can eas-ily infer their positions and fill in the occluded part from thecontext, i.e., the viewable part.

To address this issue, we propose Spatial CNN (SCNN),a generalization of deep convolutional neural networks toa rich spatial level. In a layer-by-layer CNN, a convolutionlayer receives input from the former layer, applies convolu-tion operation and nonlinear activation, and sends result tothe next layer. This process is done sequentially. Similarly,SCNN views rows or columns of feature maps as layers andapplies convolution, nonlinear activation, and sum opera-tions sequentially, which forms a deep neural network. Inthis way information could be propagated between neuronsin the same layer. It is particularly useful for structured ob-ject such as lanes, poles, or truck with occlusions, since thespatial information can be reinforced via inter layer propa-

Figure 2: (a) Dataset examples for different scenarios. (b) Proportion of each scenario.

gation. As shown in Fig. 1, in cases where CNN is discon-tinuous or is messy, SCNN could well preserve the smooth-ness and continuity of lane markings and poles. In our ex-periment, SCNN significantly outperforms other RNN orMRF/CRF based methods, and also gives better results thanthe much deeper ResNet-101 (He et al. 2016).

Related Work. For lane detection, most existing algo-rithms are based on hand-crafted low-level features (Aly2008; Son et al. 2015; Jung, Youn, and Sull 2016), limitingthere capability to deal with harsh conditions. Only Huvalet al. (2015) gave a primacy attempt adopting deep learn-ing in lane detection but without a large and general dataset.While for semantic segmentation, CNN based methods havebecome mainstream and achieved great success (Long, Shel-hamer, and Darrell 2015; Chen et al. 2017).

There have been some other attempts to utilize spatial in-formation in neural networks. Visin et al. (2015) and Bellet al. (2016) used recurrent neural networks to pass infor-mation along each row or column, thus in one RNN layereach pixel position could only receive information from thesame row or column. Liang et al. (2016a; 2016b) proposedvariants of LSTM to exploit contextual information in se-mantic object parsing, but such models are computation-ally expensive. Researchers also attempted to combine CNNwith graphical models like MRF or CRF, in which messagepass is realized by convolution with large kernels (Liu etal. 2015; Tompson et al. 2014; Chu et al. 2016). There arethree advantages of SCNN over these aforementioned meth-ods: in SCNN, (1) the sequential message pass scheme ismuch more computational efficiency than traditional denseMRF/CRF, (2) the messages are propagated as residual,making SCNN easy to train, and (3) SCNN is flexible andcould be applied to any level of a deep neural network.

Spatial Convolutional Neural NetworkLane Detection DatasetIn this paper, we present a large scale challenging datasetfor traffic lane detection. Despite the importance and dif-ficulty of traffic lane detection, existing datasets are eithertoo small or too simple, and a large public annotated bench-mark is needed to compare different methods (Bar Hillelet al. 2014). KITTI (Fritsch, Kuhnl, and Geiger 2013) andCamVid (Brostow et al. 2008) contains pixel level anno-

tations for lane/lane markings, but have merely hundredsof images, too small for deep learning methods. CaltechLanes Dataset (Aly 2008) and the recently released TuSim-ple Benchmark Dataset (TuSimple 2017) consists of 1224and 6408 images with annotated lane markings respectively,while the traffic is in a constrained scenario, which haslight traffic and clear lane markings. Besides, none of thesedatasets annotates the lane markings that are occluded or areunseen because of abrasion, while such lane markings can beinferred by human and is of high value in real applications.

To collect data, we mounted cameras on six different ve-hicles driven by different drivers and recorded videos duringdriving in Beijing on different days. More than 55 hours ofvideos were collected and 133,235 frames were extracted,which is more than 20 times of TuSimple Dataset. We havedivided the dataset into 88880 for training set, 9675 for vali-dation set, and 34680 for test set. These images were undis-torted using tools in (Scaramuzza, Martinelli, and Siegwart2006) and have a resolution of 1640× 590. Fig. 2 (a) showssome examples, which comprises urban, rural, and highwayscenes. As one of the largest and most crowded cities in theworld, Beijing provides many challenging traffic scenariosfor lane detection. We divided the test set into normal and8 challenging categories, which correspond to the 9 exam-ples in Fig. 2 (a). Fig. 2 (b) shows the proportion of eachscenario. It can be seen that the 8 challenging scenarios ac-count for most (72.3%) of the dataset.

For each frame, we manually annotate the traffic laneswith cubic splines. As mentioned earlier, in many cases lanemarkings are occluded by vehicles or are unseen. In real ap-plications it is important that lane detection algorithms couldestimate lane positions from the context even in these chal-lenging scenarios that occur frequently. Therefore, for thesecases we still annotate the lanes according to the context, asshown in Fig. 2 (a) (2)(4). We also hope that our algorithmcould distinguish barriers on the road, like the one in Fig. 2(a) (1). Thus the lanes on the other side of the barrier are notannotated. In this paper we focus our attention on the detec-tion of four lane markings, which are paid most attention toin real applications. Other lane markings are not annotated.

Spatial CNNTraditional methods to model spatial relationship are basedon Markov Random Fields (MRF) or Conditional Ran-

SCNN_D SCNN_U SCNN_R SCNN_L

Unary Potentials

Top hidden layer

Softmax

Iter N times

Softmax

Add

(a)

(b)

CNN

CNN

Top hidden layer Output

Message Passing

Compatibility Transform

OutputInput

Input

Final Prediction

Final Prediction

CW

H

w

CW

H

n class

n class

Figure 3: (a) MRF/CRF based method. (b) Our implementation of Spatial CNN. MRF/CRF are theoretically applied to unarypotentials whose channel number equals to the number of classes to be classified, while SCNN could be applied to the tophidden layers with richer information.

dom Fields (CRF) (Krahenbuhl and Koltun 2011). Recentworks (Zheng et al. 2015; Liu et al. 2015; Chen et al. 2017)to combine them with CNN all follow the pipeline of Fig. 3(a), where the mean field algorithm can be implemented withneural networks. Specifically, the procedure is (1) Normal-ize: the output of CNN is viewed as unary potentials and isnormalized by the Softmax operation, (2) Message Passing,which could be realized by channel wise convolution withlarge kernels (for dense CRF, the kernel size would coverthe whole image and the kernel weights are dependent on theinput image), (3) Compatibility Transform, which could beimplemented with a 1× 1 convolution layer, and (4) Addingunary potentials. This process is iterated for N times to givethe final output.

It can be seen that in the message passing process oftraditional methods, each pixel receives information fromall other pixels, which is very computational expensive andhard to be used in real time tasks as in autonomous driving.For MRF, the large convolution kernel is hard to learn andusually requires careful initialization (Tompson et al. 2014;Liu et al. 2015). Moreover, these methods are applied to theoutput of CNN, while the top hidden layer, which comprisesricher information, might be a better place to model spatialrelationship.

To address these issues, and to more efficiently learn thespatial relationship and the smooth, continuous prior of lanemarkings, or other structured object in the driving scenario,we propose Spatial CNN. Note that the ’spatial’ here isnot the same with that in ’spatial convolution’, but denotespropagating spatial information via specially designed CNNstructure.

As shown in the ’SCNN D’ module of Fig. 3 (b), consid-ering a SCNN applied on a 3-D tensor of size C ×H ×W ,where C, H, and W denote the number of channel, rows, andcolumns respectively. The tensor would be splited into Hslices, and the first slice is then sent into a convolution layer

with C kernels of size C×w, where w is the kernel width. Ina traditional CNN the output of a convolution layer is thenfed into the next layer, while here the output is added to thenext slice to provide a new slice. The new slice is then sent tothe next convolution layer and this process would continueuntil the last slice is updated.

Specifically, assume we have a 3-D kernel tensor K withelement Ki,j,k denoting the weight between an element inchannel i of the last slice and an element in channel j ofthe current slice, with an offset of k columes between twoelements. Also denote the element of input 3-D tensor X asXi,j,k, where i, j, and k indicate indexes of channel, row,and column respectively. Then the forward computation ofSCNN is:

X ′i,j,k =

Xi,j,k, j = 1

Xi,j,k + f(∑m

∑nX ′m, j − 1, k + n− 1

×Km,i,n

), j = 2, 3, ...,H

(1)

where f is a nonlinear activation function as ReLU. The Xwith superscript ′ denotes the element that has been updated.Note that the convolution kernel weights are shared acrossall slices, thus SCNN is a kind of recurrent neural network.Also note that SCNN has directions. In Fig. 3 (b), the four’SCNN’ module with suffix ’D’, ’U’, ’R’, ’L’ denotes SCNNthat is downward, upward, rightward, and leftward respec-tively.

AnalysisThere are three main advantages of Spatial CNN over tradi-tional methods, which are concluded as follows.

(1) Computational efficiency. As show in Fig. 4, in denseMRF/CRF each pixel receives messages from all other pix-els directly, which could have much redundancy, while inSCNN message passing is realized in a sequential propaga-tion scheme. Specifically, assume a tensor with H rows and

(a) (b)

Figure 4: Message passing directions in (a) dense MRF/CRFand (b) Spatial CNN (rightward). For (a), only messagepassing to the inner 4 pixels are shown for clearance.

W columns, then in dense MRF/CRF, there is message passbetween every two of the WH pixels. For niter iterations,the number of message passing is niterW 2H2. In SCNN,each pixel only receive information from w pixels, thus thenumber of message passing is ndirWHw, where ndir and wdenotes the number of propagation directions in SCNN andthe kernel width of SCNN respectively. niter could rangefrom 10 to 100, while in this paper ndir is set to 4, cor-responding to 4 directions, and w is usually no larger than10 (in the example in Fig. 4 (b) w = 3). It can be seenthat for images with hundreds of rows and columns, SCNNcould save much computations, while each pixel still couldreceive messages from all other pixels with message propa-gation along 4 directions.

(2) Message as residual. In MRF/CRF, message passingis achieved via weighted sum of all pixels, which, accordingto the former paragraph, is computational expensive. Andrecurrent neural network based methods might suffer fromgradient descent (Pascanu, Mikolov, and Bengio 2013), con-sidering so many rows or columns. However, deep residuallearning (He et al. 2016) has shown its capability to easythe training of very deep neural networks. Similarly, in ourdeep SCNN messages are propagated as residual, which isthe output of ReLU in Eq.(1). Such residual could also beviewed as a kind of modification to the original neuron.As our experiments will show, such message pass schemeachieves better results than LSTM based methods.

(3) Flexibility Thanks to the computational efficiency ofSCNN, it could be easily incorporated into any part of aCNN, rather than just output. Usually, the top hidden layercontains information that is both rich and of high semantics,thus is an ideal place to apply SCNN. Typically, Fig. 3 showsour implementation of SCNN on the LargeFOV (Chen et al.2017) model. SCNNs on four spatial directions are addedsequentially right after the top hidden layer (’fc7’ layer) tointroduce spatial message propagation.

ExperimentWe evaluate SCNN on our lane detection dataset andCityscapes (Cordts et al. 2016). In both tasks, we train themodels using standard SGD with batch size 12, base learn-ing rate 0.01, momentum 0.9, and weight decay 0.0001. Thelearning rate policy is ”poly” with power and iteration num-ber set to 0.9 and 60K respectively. Our models are modifiedbased on the LargeFOV model in (Chen et al. 2017). The ini-

Conv w3 c64

HConv h4 w3 c1024

Softmax

Interp ×8 AvgPool 2×2

FC c128

FC c4

Sigmoid

0 1 1 1

Cross entropy loss

Spatial cross entropy loss

Conv w1 c128

Conv w1 c5

conv1_1

fc6

fc7

fc8

Larg

eFO

V

CNN

0.02 0.99 0.99 0.99

(a) (b)

Figure 5: (a) Training model, (b) Lane prediction process.’Conv’,’HConv’, and ’FC’ denotes convolution layer, atrousconvolution layer (Chen et al. 2017), and fully connectedlayer respectively. ’c’, ’w’, and ’h’ denotes number of outputchannels, kernel width, and ’rate’ for atrous convolution.

tial weights of the first 13 convolution layers are copied fromVGG16 (Simonyan and Zisserman 2015) trained on Ima-geNet (Deng et al. 2009). All experiments are implementedon the Torch7 (Collobert, Kavukcuoglu, and Farabet 2011)framework.

Lane DetectionLane detection model Unlike common object detectiontask that only requires bounding boxes, lane detection re-quires precise prediction of curves. A natural idea is that themodel should output probability maps (probmaps) of thesecurves, thus we generate pixel level targets to train the net-works, like in semantic segmentation tasks. Instead of view-ing different lane markings as one class and do clusteringafterwards, we want the neural network to distinguish dif-ferent lane markings on itself, which could be more robust.Thus these four lanes are viewed as different classes. More-over, the probmaps are then sent to a small network to giveprediction on the existence of lane markings.

During testing, we still need to go from probmaps tocurves. As shown in Fig.5 (b), for each lane marking whoseexistence value is larger than 0.5, we search the correspond-ing probmap every 20 rows for the position with the high-est response. These positions are then connected by cubicsplines, which are the final predictions.

As shown in Fig.5 (a), the detailed differences betweenour baseline model and LargeFOV are: (1) the output chan-nel number of the ’fc7’ layer is set to 128, (2) the ’rate’ forthe atrous convolution layer of ’fc6’ is set to 4, (3) batch nor-malization (Ioffe and Szegedy 2015) is added before eachReLU layer, (4) a small network is added to predict the ex-istence of lane markings. During training, the line width ofthe targets is set to 16 pixels, and the input and target imagesare rescaled to 800× 288. Considering the imbalanced label

Figure 6: Evaluation based on IoU. Green lines denoteground truth, while blue and red lines denote TP and FP re-spectively.

between background and lane markings, the loss of back-ground is multiplied by 0.4.

Evaluation In order to judge whether a lane marking issuccessfully detected, we view lane markings as lines withwidths equal to 30 pixel and calculate the intersection-over-union (IoU) between the ground truth and the predic-tion. Predictions whose IoUs are larger than certain thresh-old are viewed as true positives (TP), as shown in Fig. 6.Here we consider 0.3 and 0.5 thresholds corresponding toloose and strict evaluations. Then we employ F-measure =(1+ β2) Precision Recall

β2Precision+Recall as the final evaluation index, wherePrecision = TP

TP+FP and Recall = TPTP+FN . Here β is set

to 1, corresponding to harmonic mean (F1-measure).

Ablation Study In section 2.2 we propose Spatial CNN toenable spatial message propagation. To verify our method,we will make detailed ablation studies in this subsection.Our implementation of SCNN follows that shown in Fig. 3.

(1) Effectiveness of multidirectional SCNN. Firstly, we in-vestigate the effects of directions in SCNN. We try SCNNthat has different direction implementations, the results areshown in Table. 1. Here the kernel width w of SCNN is set to5. It can be seen that the performance increases as more di-rections are added. To prove that the improvement does notresult from more parameters but from the message passingscheme brought about by SCNN, we add an extra convolu-tion layer with 5×5 kernel width after the top hidden layer ofthe baseline model and compare with our method. From theresults we can see that extra convolution layer could merelybring about little improvement, which verifies the effective-ness of SCNN.

Table 1: Experimental results on SCNN with different di-rectional settings. F1 denotes F1-measure, and the value inthe bracket denotes the IoU threshold. The suffix ’D’, ’U’,’R’, ’L’ denote downward, upward, rightward, and leftwardrespectively.

Models Baseline ExtraConv SCNN D SCNN DU SCNN DURL

F1 (0.3) 77.7 77.6 79.5 79.9 80.2F1 (0.5) 63.2 64.0 68.6 69.4 70.4

(2) Effects of kernel width w. We further try SCNN withdifferent kernel width based on the ”SCNN DURL” model,as shown in Table. 2. Here the kernel width denotes the num-ber of pixels that a pixel could receive messages from, andthew = 1 case is similar to the methods in (Visin et al. 2015;Bell et al. 2016). The results show that larger w is beneficial,

and w = 9 gives a satisfactory result, which surpasses thebaseline by a significant margin 8.4% and 3.2% correspond-ing to different IoU threshold.

Table 2: Experimental results on SCNN with different kernelwidths.

Kernel width w 1 3 5 7 9 11

F1 (0.3) 78.5 79.5 80.2 80.5 80.9 80.6F1 (0.5) 66.3 68.9 70.4 71.2 71.6 71.7

(3) Spatial CNN on different positions. As mentioned ear-lier, SCNN could be added to any place of a neural network.Here we consider the SCNN DURL model applied on (1)output and (2) the top hidden layer, which correspond toFig. 3. The results in Table. 3 indicate that the top hiddenlayer, which comprises richer information than the output,turns out to be a better position to apply SCNN.

Table 3: Experimental results on spatial CNN at differentpositions, with w = 9.

Position Output Top hidden layer

F1 (0.3) 79.9 80.9F1 (0.5) 68.8 71.6

(4) Effectiveness of sequential propagation. In our SCNN,information is propagated in a sequential way, i.e., a slicedoes not pass information to the next slice until it has re-ceived information from former slices. To verify the effec-tiveness of this scheme, we compare it with parallel prop-agation, i.e., each slice passes information to the next slicesimultaneously before being updated. For this parallel case,the ′ in the right part of Eq.(1) is removed. As Table. 4shows, the sequential message passing scheme outperformsthe parallel scheme significantly. This result indicates thatin SCNN, a pixel does not merely affected by nearby pixels,but do receive information from further positions.

Table 4: Comparison between sequential and parallel mes-sage passing scheme, for SCNN DULR with w = 9.

Message passing scheme Parallel Sequential

F1 (0.3) 78.4 80.9F1 (0.5) 65.2 71.6

(5) Comparison with state-of-the-art methods. To fur-ther verify the effectiveness of SCNN in lane detec-tion, we compare it with several methods: the rnn basedReNet (Visin et al. 2015), the MRF based MRFNet, theDenseCRF (Krahenbuhl and Koltun 2011), and the verydeep residual network (He et al. 2016). For ReNet basedon LSTM, we replace the ”SCNN” layers in Fig. 3 withtwo ReNet layers: one layer to pass horizontal informationand the other to pass vertical information. For DenseCRF,we use dense CRF as post-processing and employ 10 meanfield iterations as in (Chen et al. 2017). For MRFNet, we use

Table 5: Comparison with other methods, with IoU threshold=0.5. For crossroad, only FP are shown.Category Baseline ReNet DenseCRF MRFNet ResNet-50 ResNet-101 Baseline+SCNN

Normal 83.1 83.3 81.3 86.3 87.4 90.2 90.6Crowded 61.0 60.5 58.8 65.2 64.1 68.2 69.7Night 56.9 56.3 54.2 61.3 60.6 65.9 66.1No line 34.0 34.5 31.9 37.2 38.1 41.7 43.4Shadow 54.7 55.0 56.3 59.3 60.7 64.6 66.9Arrow 74.0 74.1 71.2 76.9 79.0 84.0 84.1Dazzle light 49.9 48.2 46.2 53.7 54.1 59.8 58.5Curve 61.0 59.9 57.8 62.3 59.8 65.5 64.4Crossroad 2060 2296 2253 1837 2505 2183 1990Total 63.2 62.9 61.0 67.0 66.7 70.8 71.6

Figure 7: Comparison between probmaps of baseline, ReNet, MRFNet, ResNet-101, and SCNN.

the implementation in Fig. 3 (a), with iteration times andmessage passing kernel size set to 10 and 20 respectively.The main difference of the MRF here with CRF is that theweights of message passing kernels are learned during train-ing rather than depending on the image. For ResNet, ourimplementation is the same with (Chen et al. 2017) exceptthat we do not use the ASPP module. For SCNN, we addSCNN DULR module to the baseline, and the kernel widthw is 9. The test results on different scenarios are shown inTable 5, and visualizations are given in Fig. 7.

From the results, we can see that the performance ofReNet is not even comparable with SCNN DULR with w =1, indicating the effectiveness of our residual message pass-ing scheme. Interestingly, DenseCRF leads to worse resulthere, because lane markings usually have less appearanceclues so that dense CRF cannot distinguish lane markingsand background. In contrast, with kernel weights learnedfrom data, MRFNet could to some extent smooth the resultsand improve performance, as Fig. 7 shows, but are still notvery satisfactory. Furthermore, our method even outperformthe much deeper ResNet-50 and ResNet-101. Despite theover a hundred layers and the very large receptive field ofResNet-101, it still gives messy or discontinuous outputs inchallenging cases, while our method, with only 16 convolu-tion layers plus 4 SCNN layers, could preserve the smooth-ness and continuity of lane lines better. This demonstratesthe much stronger capability of SCNN to capture structureprior of objects over traditional CNN.

(6) Computational efficiency over other methods. In theAnalysis section we give theoretical analysis on the com-putational efficiency of SCNN over dense CRF. To verify

this, we compare their runtime experimentally. The resultsare shown in Table. 6, where the runtime of the LSTM inReNet is also given. Here the runtime does not include run-time of the backbone network. For SCNN, we test both thepractical case and the case with the same setting as denseCRF. In the practical case, SCNN is applied on top hiddenlayer, thus the input has more channels but less hight andwidth. In the fair comparison case, the input size is modifiedto be the same with that in dense CRF, and both methods aretested on CPU. The results show that even in fair comparisoncase, SCNN is over 4 times faster than dense CRF, despitethe efficient implementation of dense CRF in (Krahenbuhland Koltun 2011). This is because SCNN significantly re-duces redundancy in message passing, as in Fig. 4. Also,SCNN is more efficient than LSTM, whose gate mechanismrequires more computation.

Table 6: Runtime of dense CRF, LSTM, and SCNN. The twoSCNNs correspond to the one used in practice and the onewhose input size is modified for fair comparison with denseCRF respectively. The kernel width w of SCNN is 9.

Method dense CRF LSTM SCNN DULR(in practice)

SCNN DULR(fair comparison)

Input size(C ×H ×W )

5×288×800 128×36×100 128×36×100 5×288×800

Device CPU2 GPU3 GPU CPURuntime (ms) 737 115 42 176

2Intel Core i7-4790K CPU3GeForce GTX TITAN Black

Table 7: Results on Cityscapes validation set.

Method road terrain building wall car pole trafficlight

trafficsign fence sidewalk sky rider person vegetation truck bus train motor bicycle mIoU

LargeFOV 97.0 59.2 89.9 42.2 92.3 52.9 62.3 71.1 52.2 78.8 92.2 52.1 75.9 91.0 48.8 70.2 37.6 54.6 72.3 68.0LargeFOV+SCNN 97.0 59.8 90.3 45.7 92.5 55.2 62.3 71.7 52.5 78.1 92.6 53.2 76.4 91.1 55.6 71.2 41.7 56.2 72.3 69.2

ResNet-101 98.3 64.2 92.4 44.5 94.9 66.0 74.5 82.1 59.9 86.0 94.7 65.5 84.1 92.7 57.3 81.1 54.0 64.5 80.0 75.6ResNet-101+SCNN 98.3 65.4 92.6 46.7 94.8 66.1 74.3 81.5 61.2 86.1 94.7 65.5 84.0 92.7 57.7 82.0 59.9 67.0 80.1 76.4

Figure 8: Visual improvements on Cityscapes validation set. For each example, from left to right are: input image, ground truth,result of LargeFOV, result of LargeFOV+SCNN.

Semantic Segmentation on CityscapesTo demonstrate the generality of our method, we alsoevaluate Spatial CNN on Cityscapes (Cordts et al. 2016).Cityscapes is a standard benchmark dataset for semanticsegmentation on urban traffic scenes. It contains 5000 fineannotated images, including 2975 for training, 500 for val-idation and 1525 for testing. 19 categories are defined in-cluding both stuff and objects. We use two classic mod-els, the LargeFOV and ResNet-101 in DeepLab (Chen et al.2017) as the baselines. Batch normalization layers (Ioffe andSzegedy 2015) are added to LargeFOV to enable faster con-vergence. For both models, the channel numbers of the tophidden layers are modified to 128 to make them compacter.

We add SCNN to the baseline models in the same way asin lane detection. The comparisons between baselines andthose combined with the SCNN DURL models with kernelwidth w = 9 are shown in Table 7. It can be seen that SCNNcould also improve semantic segmentation results. With SC-NNs added, the IoUs for all classes are at least comparableto the baselines, while the ”wall”, ”pole”, ”truck”, ”bus”,”train”, and ”motor” categories achieve significant improve.This is because for long shaped objects like train and pole,SCNN could capture its continuous structure and connectthe disconnected part, as shown in Fig. 8. And for wall,truck, and bus which could occupy large image area, the dif-fusion effect of SCNN could correct the part that are mis-classified according to the context. This shows that SCNNis useful not only for long thin structure, but also for largeobjects which require global information to be classifiedcorrectly. There is another interesting phenomenon that thehead of the vehicle at the bottom of the images, whose labelis ignored during training, is in a mess in LargeFOV whilewith SCNN added it is classified as road. This is also due tothe diffusion effects of SCNN, which passes the informationof road to the vehicle head area.

To compare our method with other MRF/CRF based

Table 8: Comparison between our SCNN and otherMRF/CRF based methods on Cityscapes test set.

Method LargeFOV(Chen et al. 2017)

DPN(Liu et al. 2015) Ours

mIoU 63.1 66.8 68.2

methods, we evaluate LargeFOV+SCNN on Cityscapes testset, and compare with methods that also use VGG16 (Si-monyan and Zisserman 2015) as the backbone network. Theresults are shown in Table 8. Here LargeFOV, DPN, and ourmethod use dense CRF, dense MRF, and SCNN respectively,and share nearly the same base CNN part. The results showthat our method achieves significant better performance.

ConclusionIn this paper, we propose Spatial CNN, a CNN-like schemeto achieve effective information propagation in the spatiallevel. SCNN could be easily incorporated into deep neu-ral networks and trained end-to-end. It is evaluated at twotasks in traffic scene understanding: lane detection and se-mantic segmentation. The results show that SCNN could ef-fectively preserve the continuity of long thin structure, whilein semantic segmentation its diffusion effects is also provedto be beneficial for large objects. Specifically, by introduc-ing SCNN into the LargeFOV model, our 20-layer networkoutperforms ReNet, MRF, and the very deep ResNet-101 inlane detection. Last but not least, we believe that the largechallenging lane detection dataset we presented would pushforward researches on autonomous driving.

AcknowledgmentsThis work is supported by SenseTime Group Limited. Wewould like to thank Xiaohang Zhan, Jun Li, and XudongCao for helpful work in building the lane detection dataset.

ReferencesAly, M. 2008. Real time detection of lane markers in urban streets.In Intelligent Vehicles Symposium, 2008 IEEE, 7–12. IEEE.Bar Hillel, A.; Lerner, R.; Levi, D.; and Raz, G. 2014. Recentprogress in road and lane detection: a survey. Machine vision andapplications 1–19.Bell, S.; Lawrence Zitnick, C.; Bala, K.; and Girshick, R. 2016.Inside-outside net: Detecting objects in context with skip poolingand recurrent neural networks. In CVPR.Brostow, G. J.; Shotton, J.; Fauqueur, J.; and Cipolla, R. 2008.Segmentation and recognition using structure from motion pointclouds. In ECCV.Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille,A. L. 2017. Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connected crfs.TPAMI.Chu, X.; Ouyang, W.; Wang, X.; et al. 2016. Crf-cnn: Modelingstructured information in human pose estimation. In NIPS.Collobert, R.; Kavukcuoglu, K.; and Farabet, C. 2011. Torch7: Amatlab-like environment for machine learning. In BigLearn, NIPSWorkshop, number EPFL-CONF-192376.Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.;Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. Thecityscapes dataset for semantic urban scene understanding. InCVPR.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L.2009. Imagenet: A large-scale hierarchical image database. InCVPR.Fritsch, J.; Kuhnl, T.; and Geiger, A. 2013. A new performancemeasure and evaluation benchmark for road detection algorithms.In Intelligent Transportation Systems-(ITSC), 2013 16th Interna-tional IEEE Conference on, 1693–1700. IEEE.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In CVPR.Huval, B.; Wang, T.; Tandon, S.; Kiske, J.; Song, W.; Pazhayam-pallil, J.; Andriluka, M.; Rajpurkar, P.; Migimatsu, T.; Cheng-Yue,R.; et al. 2015. An empirical evaluation of deep learning on high-way driving. arXiv preprint arXiv:1504.01716.Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerat-ing deep network training by reducing internal covariate shift. InICML.Jung, S.; Youn, J.; and Sull, S. 2016. Efficient lane detection basedon spatiotemporal images. IEEE Transactions on Intelligent Trans-portation Systems 17(1):289–295.Krahenbuhl, P., and Koltun, V. 2011. Efficient inference in fullyconnected crfs with gaussian edge potentials. In NIPS.Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenetclassification with deep convolutional neural networks. In NIPS.Liang, X.; Shen, X.; Feng, J.; Lin, L.; and Yan, S. 2016a. Semanticobject parsing with graph lstm. In ECCV.Liang, X.; Shen, X.; Xiang, D.; Feng, J.; Lin, L.; and Yan, S. 2016b.Semantic object parsing with local-global long short-term memory.In CVPR.Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; and Tang, X. 2015. Semanticimage segmentation via deep parsing network. In ICCV.Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutionalnetworks for semantic segmentation. In CVPR.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficultyof training recurrent neural networks. In ICML.

Scaramuzza, D.; Martinelli, A.; and Siegwart, R. 2006. A flexi-ble technique for accurate omnidirectional camera calibration andstructure from motion. In Computer Vision Systems, 2006 ICVS’06.IEEE International Conference on, 45–45. IEEE.Simonyan, K., and Zisserman, A. 2015. Very deep convolutionalnetworks for large-scale image recognition. In ICLR.Son, J.; Yoo, H.; Kim, S.; and Sohn, K. 2015. Real-time illumi-nation invariant lane detection for lane departure warning system.Expert Systems with Applications 42(4):1816–1824.Tompson, J. J.; Jain, A.; LeCun, Y.; and Bregler, C. 2014. Jointtraining of a convolutional network and a graphical model for hu-man pose estimation. In NIPS.TuSimple. 2017. Tusimple benchmark. http://benchmark.tusimple.ai/#/.Urmson, C.; Anhalt, J.; Bagnell, D.; Baker, C.; Bittner, R.; Clark,M.; Dolan, J.; Duggins, D.; Galatali, T.; Geyer, C.; et al. 2008.Autonomous driving in urban environments: Boss and the urbanchallenge. Journal of Field Robotics 25(8):425–466.Visin, F.; Kastner, K.; Cho, K.; Matteucci, M.; Courville, A.; andBengio, Y. 2015. Renet: A recurrent neural network based alterna-tive to convolutional networks. arXiv preprint arXiv:1505.00393.Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.;Du, D.; Huang, C.; and Torr, P. H. 2015. Conditional random fieldsas recurrent neural networks. In ICCV.

spatial as deep: spatial cnn for trafﬁc scene understanding

Documents