arxiv:1910.00895v1 [cs.cv] 2 oct 2019 · 1ayush gaud, y v s harish and k madhava krishna are with...

7
Object Parsing in Sequences Using CoordConv Gated Recurrent Networks Ayush Gaud 1 , Y V S Harish 1 and K Madhava Krishna 1 Fig. 1: Overview of our approach. We propose a recurrent hourglass network based on CoordConvGRU cells. We demonstrate that the spatio-temporal consistency is preserved and we generate increasingly refined estimates over sequential data. All the network modules share the same weight across stages while the hidden states from the recurrent cell is passed as input for the next iteration of keypoint localization on subsequent frames. Abstract— We present a monocular object parsing frame- work for consistent keypoint localization by capturing temporal correlation on sequential data. In this paper, we propose a novel recurrent network based architecture to model long- range dependencies between intermediate features which are highly useful in tasks like keypoint localization and tracking. We leverage the expressiveness of the popular stacked hourglass architecture and augment it by adopting memory units between intermediate layers of the network with weights shared across stages for video frames. We observe that this weight sharing scheme not only enables us to frame hourglass architecture as a recurrent network but also prove to be highly effec- tive in producing increasingly refined estimates for sequential tasks. Furthermore, we propose a new memory cell, we call CoordConvGRU which learns to selectively preserve spatio- temporal correlation and showcase our results on the keypoint localization task. The experiments show that our approach is able to model the motion dynamics between the frames and significantly outperforms the baseline hourglass network. Even though our network is trained on a synthetically rendered dataset, we observe that with minimal fine tuning on 300 real images we are able to achieve performance at par with various state-of-the-art methods trained with the same level of supervisory inputs. By using a simpler architecture than other methods enables us to run it in real time on a standard GPU which is desirable for such applications. Finally, we make our architectures and 524 annotated sequences of cars from KITII dataset publicly available. 1 Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad, India. [email protected] [email protected] [email protected] I. INTRODUCTION Estimating semantically meaningful joint locations and hence shape and pose of moving objects like cars in a highway driving scenario is a challenging problem in computer vision. With the growing interest in autonomous driving, many successful approaches use expensive sensors like LiDARs and stereo cameras for tracking and predicting trajectory of vehicles on road. However, there is an ongoing effort to minimize the cost of expensive sensor suite required for such tasks. Some recent approaches have shown to recover both shape and pose from monocular images but fail to leverage the temporal characteristics of the data in such scenarios[1]. Our approach bases itself on the success of the stacked hourglass architecture[2] which showcased state of the art results for human pose prediction. While its multi- level encoder-decoder architecture with skip layers preserves spatial information at different scales, it fails to exploit any temporal information that is present in sequential data and treats each frame independently which is detrimental especially in scenarios like driving. In this paper, we posit that leveraging the temporal in- formation using recurrent architectures enables us to parse objects with higher accuracy. It is based on the notion that the corresponding part locations (keypoints) of an object is usually in the neighborhood across frames of a sequence, especially for a rigid body. We present our findings which conforms to this hypothesis and justifies the validity of our proposed network architecture for performing this task. Specifically, we demonstrate that on scenarios like driving arXiv:1910.00895v1 [cs.CV] 2 Oct 2019

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

Object Parsing in Sequences Using CoordConv Gated RecurrentNetworks

Ayush Gaud1, Y V S Harish1 and K Madhava Krishna1

Fig. 1: Overview of our approach. We propose a recurrent hourglass network based on CoordConvGRU cells. We demonstratethat the spatio-temporal consistency is preserved and we generate increasingly refined estimates over sequential data. Allthe network modules share the same weight across stages while the hidden states from the recurrent cell is passed as inputfor the next iteration of keypoint localization on subsequent frames.

Abstract— We present a monocular object parsing frame-work for consistent keypoint localization by capturing temporalcorrelation on sequential data. In this paper, we propose anovel recurrent network based architecture to model long-range dependencies between intermediate features which arehighly useful in tasks like keypoint localization and tracking.We leverage the expressiveness of the popular stacked hourglassarchitecture and augment it by adopting memory units betweenintermediate layers of the network with weights shared acrossstages for video frames. We observe that this weight sharingscheme not only enables us to frame hourglass architectureas a recurrent network but also prove to be highly effec-tive in producing increasingly refined estimates for sequentialtasks. Furthermore, we propose a new memory cell, we callCoordConvGRU which learns to selectively preserve spatio-temporal correlation and showcase our results on the keypointlocalization task. The experiments show that our approach isable to model the motion dynamics between the frames andsignificantly outperforms the baseline hourglass network. Eventhough our network is trained on a synthetically rendereddataset, we observe that with minimal fine tuning on 300real images we are able to achieve performance at par withvarious state-of-the-art methods trained with the same level ofsupervisory inputs. By using a simpler architecture than othermethods enables us to run it in real time on a standard GPUwhich is desirable for such applications. Finally, we make ourarchitectures and 524 annotated sequences of cars from KITIIdataset publicly available.

1Ayush Gaud, Y V S Harish and K Madhava Krishna are withRobotics Research Center, International Institute of InformationTechnology, Hyderabad, India. [email protected]@[email protected]

I. INTRODUCTION

Estimating semantically meaningful joint locations andhence shape and pose of moving objects like cars ina highway driving scenario is a challenging problem incomputer vision. With the growing interest in autonomousdriving, many successful approaches use expensive sensorslike LiDARs and stereo cameras for tracking and predictingtrajectory of vehicles on road. However, there is an ongoingeffort to minimize the cost of expensive sensor suite requiredfor such tasks. Some recent approaches have shown torecover both shape and pose from monocular images but failto leverage the temporal characteristics of the data in suchscenarios[1]. Our approach bases itself on the success of thestacked hourglass architecture[2] which showcased state ofthe art results for human pose prediction. While its multi-level encoder-decoder architecture with skip layers preservesspatial information at different scales, it fails to exploitany temporal information that is present in sequential dataand treats each frame independently which is detrimentalespecially in scenarios like driving.

In this paper, we posit that leveraging the temporal in-formation using recurrent architectures enables us to parseobjects with higher accuracy. It is based on the notion thatthe corresponding part locations (keypoints) of an object isusually in the neighborhood across frames of a sequence,especially for a rigid body. We present our findings whichconforms to this hypothesis and justifies the validity ofour proposed network architecture for performing this task.Specifically, we demonstrate that on scenarios like driving

arX

iv:1

910.

0089

5v1

[cs

.CV

] 2

Oct

201

9

Page 2: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

where environment involves high dynamics, this propertycan be quite useful for tasks like propagating annotationson on-road vehicles consistently across the frames, unlikeother methods[1][2][3][4] which discards the sequential con-sistency of a scene by treating each frame independently toperform predictions.

To illustrate this idea, we utilize semantically meaningfullocations of an object represented as keypoints which effi-ciently encapsulates the 3D structure of an object category.These 3D keypoints are annotated on CAD models whichenables us to generate large amounts of synthetic datasetrequired for training an architecture like this. The dataset isgenerated using the annotated CAD models of cars from theShapeNet[5] corpus. The cars are rendered from differentviewpoints by moving the camera in a natural motion forgenerating sequences which appear temporally consistent.We observe a consistent improvement in the accuracy ofpredicted keypoint locations as we continue to improvethe expressiveness of the model by making changes to theoriginal hourglass architecture by adding ConvGRU and thenCoordConvGRU cells to the residual layer. Our contributionsare summarized as follows:

1) We introduce an improved stacked hourglass architec-ture to impose spatio-temporal consistency in consec-utive video frames from a monocular camera for key-point prediction. Instead of employing RNNs based onvanilla LSTM, we utilize convolutional gated recurrentnetworks[6] part to preserve spatial connectivity in theimage.

2) We note further improvement using our proposed Co-ordConvGRU cells designed by adding two coordinatechannels, one for each axis, to the convolution layer ofConvGRU, which learns to selectively preserve trans-lational invariance and effectively learn the temporalcorrelation between intermediate features of the framesin a sequence.

We evaluate our hypothesis by conducting ablation exper-iments on different network designs to highlight the efficacyof our approach. To further emphasize the generalizability ofour approach, we also perform benchmarks on the imagesfrom KITTI dataset and compare the predicted keypointlocations with various other networks. Our architecture isable to outperform several other architectures, while at thesame time is fast enough to run in real-time on a standardGPU. In section 4 we present the qualitative and quantitativeresults of our proposed architecture and demonstrate consis-tent improvement in performance of the network across otherarchitectures.

II. RELATED WORK

Keypoint localization is a well studied topic in theliterature popularized by pose estimation for humans.DeepPose[7] shifted the attention of people from classicalapproaches to deep learning based approaches. Tompson[8]further brought a significant improvement in keypoint local-ization by introducing a fully convolution based network.

He proposed that generating heatmaps directly from a hier-archy of multi-scale convolutional structure combined witha graphical model enables learning the spatial relationshipsbetween the joints and achieves higher accuracy. Tompsonbuilt his work[8] on the idea of cascade refinement based onthe multi-stage pose machines[9].

Newell et al.[2] introduced intermediate supervision to acascaded network based on the conv-deconv and encoder-decoder architectures[10][11][12][13]. The repeated bottom-up and top-down inference from stacking further allows forreevaluation of initial estimates and features across the im-age. This helps in forming higher order spatial relationshipsand maintaining local and global cues.

In a few studies, temporal cues have also beenintegrated[14][15][16][17][18][19] to the network for thetask of pose estimation. Using dense optical flow[20] certainapproaches[18][19] have attempted in predicting consistentposition of joints preserving smooth motion across theframes. Thin-Slicing network[19] presented improved resultsby using both optical flow and a spatial-temporal model butat the cost of computational complexity rendering it slowerthan other approaches.

LSTM pose machines[21] introduced using of memoryaugmented recurrent networks to capture temporal consis-tency and was able to output perform various state of the artapproaches in 2D video pose estimation tasks. Their designwas based on another strong baseline approach ConvolutionPose Machines[22] by using essentially the same architecturebut also providing a weight sharing scheme across stageswhile at the same time using LSTM cells to promote in-termediate features. This helps in better utilizing historicaljoint locations and achieving almost double inference speedwhich is critical for real-time applications.

Siam et al.[23] investigated convolutional gated recurrentarchitecture (convGRU) for video segmentation. They pro-posed a modification to the standard convolutional recurrentunits which are designed for processing text data, to preservethe spatial connectivity between the pixels which wouldhave been otherwise lost. By replacing the dot productswith convolution operation they were able to create a muchmore efficient recurrent cell which can be easily trained onimages without the curse of high dimensionality of weights.They were able to achieve significant gains on the baselinearchitectures by exploiting the spatio-temporal informationof the videos.

Payer et al.[24], further utilizing the effectiveness of bothhourglass and ConvGRU architectures, presented a recurrenthourglass architecture for instance segmentation tracking.They rigorously evaluated their approach on various celltracking datasets and were able to achieve state-of-the-artperformance.

CoordConv layers presented by Liu et al.[25] posed acounter intuitive idea from assuming the convolution layersto be appropriate for spatial representations. This is justifiedby a trivial counterexample of learning coordinate transformsin one-hot pixel space. They claim that by adding a few extracoordinate channels enables the network to learn varying

Page 3: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

degrees of translational invariance. There is a stark contrastobserved in the rate of learning and the parameters requiredacross varying domains of applications including imageclassification and object detection.

In this paper, we approach the problem of keypointlocalization as a semantic segmentation and tracking taskunlike other approaches and propose a novel architecturethat can leverage the spatio-temporal features in sequencesto generate more consistent and robust detections. Usingour proposed CoordConvGRU architecture, we demonstrateconsistent performance improvement by conducting ablationstudies on synthetic video sequences and compare our per-formance with other state-of-the-art methods in section IV.We also make the network architectures and annotated datapublicly available. 1

III. APPROACH

A. Overview

Our approach builds on the notion that in a real-worldcontext the data possess temporal consistency and the spatialfeatures move gradually with time. Hence, instead of doingprocessing on static images we propose a network which canlearn to model the motion dynamics in an end-to-end fashion.We leverage the recurrent neural network architecture whichjointly learns to model spatial and temporal features fromvideos.

B. Network Architecture

Fig. 3: Two Stack Hourglass with Recurrent Cells for SkipLayers

1) Stacked Hourglass with Intermediate Supervision:Stacked hourglass networks proposed by Newell et al.[2]shows that stacking conserves the higher order spatial rela-tionships. Stages of hourglass enable inferring higher orderfeatures when intermediate supervision is applied. We utilizea similar architecture with iterative refinement using multiplestacks of hourglass for predicting keypoint likelihood map.

2) Convolution Gated Recurrent Units (ConvGRU): Con-ventional Recurrent Neural Networks are applied to sequenceof inputs to capture the temporal relation in the data. How-ever, due to vanishing gradient problem gated architectureswere proposed. Long Short Term Memory (LSTM), one ofthe most popular model used for RNNs has three gates

1Project page and annotated KITTI sequences: https://ayushgaud.github.io/Stacked_HG_CoordConvGRU/

Fig. 4: Comparison of different types of GRU cells

namely, input, output and forget. The latter controls theamount of information flow from the previous states andacts as a memory for predictions. Gated Recurrent Unit(GRU) acts just like LSTMs but with a much efficientarchitecture by assuming a correlation between memorizingand forgetting. This enables it to use only a single gate tocontrol the two and consequently the output flow. However,for using RNNs with images, they are vectorized into large1D arrays which increases the number of parameters tolearn and loses the spatial connectivity between pixels. Thismakes the network harder to train due to large search spacewhere spatial context is already lost. To mitigate this problemconvolutional recurrent units were introduced which, unlikeregular GRU replaces dot products with small convolutionfilters as shown in the equation below

zt = σ (Whz ∗ht−1 +Wxz ∗ xt +bz) (1a)

rt = σ (Whr ∗ht−1 +Wxr ∗ xt +br) (1b)

ht = Φ(Wh ∗ (rt �ht−1)+Wx ∗ xt +b) (1c)

ht = (1− zt)�ht−1 + z� ht (1d)

Learning such small filters instead of weights for each pixelproves to be much more efficient while at the same timepreserving spatial context.

3) CoordConv Layer: CoordConv layers[25] are a simpleextension to regular convolution layers. They add two extrachannels (in case of images) filled with coordinate infor-mation which are concatenated at the end of a convolutionlayer. This enables the network to flexibly learn translationalinvariance. It means that zero weights for these channelswould imply full translational invariance like standard con-volution while any other value would lead the network tolearn translation invariance upto a degree which becomes atraining parameter depending on the task.

4) Proposed Network Architecture: We introduce an im-proved architecture based on the stacked hourglass design forpreserving both spatial and temporal context in a sequenceof images. Capitalizing on the above notions, we present arecurrent architecture by replacing standard convolutions forskip connections with ConvGRU layers. The RNN cells are

Page 4: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

Fig. 5: Synthesis Pipeline: The above figure depicts the synthetic dataset generation flow. ShapeNet[5] CAD models arerendered with Blender from various viewpoints in a smooth motion for generating sequences. Green blocks represent theoutput of the pipeline which includes the sequence images with background, its corresponding 2D keypoints and a list ofvisible keypoints.

stacked together for processing a sequence of four images,while the weights among the stacks for each frame is sharedwithin the graph. Since in the original paper[2] there wasonly a marginal performance improvement observed whencomparing eight stack hourglass network with a two stacknetwork, hence we also consider a two stack architecturewith intermediate supervision. Each stack consists of fivelayers of convolution filters with max pooling operation withpool size of (2,2). The output of each layer is passed tothe next convolution layer and the skip layer consistingof GRU cell. This is followed by a series of upsamplinglayers and subsequent convolutions to bring the output sizeto 64x64. Each convolution filter has a channel size of 36corresponding to each keypoint heatmap except in case ofCoordConvGRU cells where two channels are added for eachi and j coordinates as shown in figure 4. We use four residualmodules each embedded with an RNN cell for both stacks.We train the network on 64x64 images. The network outputs36 heatmaps of size 64x64 at the end of every stack alongwith the states of the recurrent cells. We further show that ourproposed cell which is created by modifying the ConvGRUlayers, and using CoordConv layers i.e adding two channelsone for X and one for Y axis, leads to further improvementin performance. The implementation of this architecture ispublicly made available and can be downloaded from here.2

C. Loss Function

Newell et al.[2] in the original stacked hourglass papertreats the keypoint localization task as a regression problemand hence uses mean squared error loss for optimizingover the heatmaps. We posit a different route by treating

2Tensorflow implementation: https://github.com/ayushgaud/hourglass_CoordConvGRU

this task as a classification problem. This notion stemsfrom the similarity between the semantic segmentation andkeypoint localization task. Due to this, we use sigmoid crossentropy as a loss function over heatmaps. These heatmapsare computed by using Gaussian distribution centered atground truth location of keypoints with standard deviation of1 pixel. In our case we represent the model of vehicle using36 keypoints annotated at semantically meaningful jointlocations. If predicted heatmaps for ith image in sequenceof length N is represented by Xi and labeled heatmap isrepresented by Zi. Then the sigmoid cross entropy loss canbe written as:

loss =N

∑i=1

Xi −Xi ∗Zi + log(1+ e−Xi) (2)

We minimize this loss between predicted heatmaps from bothstacks and ground truth heatmaps leveraging the benefitsof intermediate supervision along with standard form ofsupervision done at the end of the network.

D. Dataset Generation and Training

1) Synthetic Dataset Generation: Since we require largeamount of data to train our network and annotated sequencesare hard to find, We use the 3D keypoint annotations from[26] and render CAD models of cars from ShapeNet[5]using the Render for CNN [27] pipeline. To enable sufficientvariations in the data we randomly sample initial viewpointsand their respective increments using a uniform distribution.These viewpoint increments helps in generating desirablesmooth realistic motions. We also vary the number of lightsources and their intensity randomly for each sequence ofdata. An overview of the synthesis pipeline is given in figure5. We use 472 3D CAD models and their corresponding

Page 5: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

annotations provided by [28]. The 36 semantically mean-ingful joints on each car are projected from the viewpointsat which the images are rendered. To identify visible setof keypoints for evaluation, we also render depth channeldata and record camera viewpoints for each frame. We thenuse this information and perform ray tracing to identify thedepths of each keypoint. This depth is compared to thedepth of the projected keypoints from the rendered depthimage and the mismatching keypoints are identified to beoccluded. Each synthetic image is cropped and overlayed ona realistic background image to avoid overfitting. To maintainconsistency the background image is kept same across theframes of each sequence. A total of hundred sequences withfour frames each are generated for every CAD model.

2) Training Details: The proposed network is built onthe TensorFlow framework. We initialize the weights usingXavier initialization scheme with a learning rate of 2.5 ∗10−4. We use RMSProp optimizer and decay the learningrate exponentially at every 20000 steps by a factor of 0.96.The input to the network are images of size is 64x64 atbatches of 60 for initial training. Once the loss plateaus, thebackpropagation is performed on the unwrapped recurrentnetwork with shared pre-trained weights on a sequencelength of 4 frames and a batch size of 16. The loss iscomputed on all the frames in the sequence and used foroptimizing the network. We stop the training at around 500Kiterations in total which takes almost 2 days to complete on4 Nvidia 1080Ti GPUs.

IV. RESULTS

A. Keypoint Localization

We evaluate the performance of our network using thePCK metric. This is a standard method used to label a key-point detection as correct depending on whether or not it lieswithin the circle of radius α ∗L centered at the ground truthkeypoint location where L represents the larger dimensionof the image. Instead of using the standard α = 0.1 we usea much stricter threshold of 0.05. Even though the networkis trained on all the keypoints irrespective of visibility, wegenerate quantitative results based on the visible set ofkeypoints to remain consistent with the convention. It shouldbe noted that the wireframes are made by joining all thepredicted keypoint locations, including the occluded ones ina structural fashion. Figure 6 shows the ground truth andthe predicted heatmap from the network. We take argmaxover each of the 36 predicted heatmaps of all the joints andconstruct a wireframe by joining these points.

B. Ablation Experiments

Given that our main contributions are based on the networkarchitecture, we explore the effect of adding recurrent cellsby comparing the 2D keypoint localization accuracy. Tomake a fair comparison we train all the networks fromscratch on the same rendered synthetic dataset consideringstandard hourglass network as a baseline. We train three dif-ferent network architectures including the baseline hourglassand tabulate the PCK measure at α = 0.05 in Table I.

Fig. 6: Ground truth and predicted keypoint likelihood maps(heatmaps)

It is interesting to note that adding ConvGRU layers itselfimproves the mean 2D keypoint accuracy from 94.8% to96.1% thereby proving to have learnt to model the motiondynamics between the frames in sequences. We also observethat further changing the RNN cells in residual layers fromConvGRU to CoordConvGRU improved the result by another0.6% at 96.7%. This improvement could be attributed to theselective translational invariance nature of the layer whichenables it to capture the spatio-temporal consistencies in thesequence.

NetworkPCK @α = 0.05

Standard Hourglass (Baseline) 94.81Hourglass ConvGRU + Sequence 96.10

Hourglass CoordConvGRU + Sequence 96.71

TABLE I: PCK Comparison results between the baselinehourglass network and proposed network architectures.

NetworkPCK @α = 0.1

DDN [29] 67.6WN-gt-yaw [3] 88.0Zia et al. [28] 73.6

DSN-2D 27.2DSN-3D 76.0plain-2D 45.2plain-3D 88.4

DISCO-3D-2D 90.1DISCO-vis-3D-2D 92.3

DISCO-Vgg 83.5DISCO [4] 93.1

Standard Hourglass 82.11Hourglass CoordConvGRU 88.27

TABLE II: PCK accuracies on KITTI 3D dataset[28]

C. Results on KITTI images

We fine tune the proposed architectures on 300 real imagesfrom KITTI. The networks are not refined on sequentialdata but only static images. The qualitative results in figure

Page 6: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

Fig. 7: Wireframes predicted by the network on KITTI image sequences

NetworkPCK @α = 0.1

Standard Hourglass 82.33Hourglass CoordConvGRU 87.25

Hourglass CoordConvGRU + Sequence 87.81

TABLE III: PCK comparison on annotated KITTI sequences

7 shows that our networks are able to handle this datadistribution transform from synthetic to real quite effectivelywith minimal training. We also perform a quantitative com-parison of our approach on the annotated KITTI[30] images.Table II reports PCK accuracies for various state-of-the-artmethods including Zia et al.[28], DNN[29], WarpNet[3] andvarious variants of DISCO[4]. Since the DISCO architectureis based on supervision for intermediate tasks, we alsoreport the comparison of PCK values for variants dependingon the supervision performed. DISCO-vis-3D-2D, DISCO-3D-2D, plain-3D and plain-2D are networks without pose,pose+visibility, pose+visibility+2D and pose+visibility+3D,respectively removed as a supervisory input to the completedeep supervision required for DISCO. It should be noted thatwe only provide 2D keypoint supervision for training andstill achieve better results than most of the networks usinga much simpler architecture. The amount of supervisionrequired for training our network is equivalent to plan-2D andDSN-2D which perform significantly poorly on the keypointlocalization task at PCKs 45.2% and 27.2% respectively.This shows that our proposed architecture at PCK 88.27%has far better expressive capabilities compared to the othermethods for this task. We also outperform standard hourglassby almost 6.1% on static KITTI images from [28].

We also use our labeled 524 sequences of images from

KITTI for evaluation of performance of the network onsequences. In Table III we have tabulated the PCK valueson this data. It is clear that again our proposed networksurpasses our baseline by a significant margin of around 6%and is 0.5% higher than without sequence. It should also benoted that the disparity between performance is significantwhen compared to synthetic data. This can be because of thebetter convergence behavior of the CoordConv[25] unit of theRNN cells. Since we fine-tune the network on a limited setof images this behavior is not contrary to the expectations.

D. Run-Time Analysis

While using RNNs on sequential data gives significantperformance improvements in terms of accuracy, they docome at a price. Run time of ConvGRU and CoordConvGRUis higher as they perform more operations than a regularconvolution layer. At the same time since the CoordConvlayers add two more channels to each convolution layerinside the RNN cell, this further increases the forwardpass time of the network. Our network is much simpler inarchitecture than the state-of-the-art DISCO[4] as it based onthe very deep VGG network architecture with fully connectedlayers for intermediate task supervision. This enables us torun this network for applications requiring real-time timepredictions. We evaluated the forward pass times of all thethree networks and the results are presented in Table IV.Since while doing forward pass, the network only requiresimages and hidden states of the previous RRN cells hencethe run time is computed for a single frame and not thesequence.

Page 7: arXiv:1910.00895v1 [cs.CV] 2 Oct 2019 · 1Ayush Gaud, Y V S Harish and K Madhava Krishna are with Robotics Research Center, International Institute of Information Technology, Hyderabad,

Network Forward Pass TimeStandard Hourglass 11ms

Hourglass ConvGRU 19msHourglass CoordConvGRU 21ms

TABLE IV: Run time of different networks on an Nvidia1080Ti GPU

V. CONCLUSIONS

We present a framework for keypoint localization on anobject category and leverage the spatio-temporal correlationbetween frames in a sequence. The experimental resultsprove our hypothesis and demonstrate that our network per-forms at par with other networks which are trained at samelevel of supervision, using a relatively simpler architecture.We also demonstrate that our network is fast enough for realtime applications like autonomous driving. We showcase anew type of RNN cell and conduct ablation experiments tohighlight the performance benefits of using it. The resultsfrom the proposed architecture can further be improved byadding multiple levels of supervision and deserves furtherexamination. We make our proposed network architectureand annotated data for KITII sequences publicly availablefor critical analysis and extension of this work.

REFERENCES

[1] J. K. Murthy, S. Sharma, and K. M. Krishna, “Shape priors forreal-time monocular object localization in dynamic environments,” inIntelligent Robots and Systems (IROS), 2017 IEEE/RSJ InternationalConference on. IEEE, 2017, pp. 1768–1774.

[2] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in European Conference on Computer Vision.Springer, 2016, pp. 483–499.

[3] A. Kanazawa, D. W. Jacobs, and M. Chandraker, “Warpnet: Weaklysupervised matching for single-view reconstruction,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 3253–3261.

[4] C. Li, M. Zeeshan Zia, Q.-H. Tran, X. Yu, G. D. Hager, andM. Chandraker, “Deep supervision with shape concepts for occlusion-aware 3d object parsing,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017, pp. 5465–5474.

[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al.,“Shapenet: An information-rich 3d model repository,” arXiv preprintarXiv:1512.03012, 2015.

[6] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper intoconvolutional networks for learning video representations,” arXivpreprint arXiv:1511.06432, 2015.

[7] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation viadeep neural networks,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2014, pp. 1653–1660.

[8] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint trainingof a convolutional network and a graphical model for human poseestimation,” in Advances in neural information processing systems,2014, pp. 1799–1807.

[9] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh,“Pose machines: Articulated pose estimation via inference machines,”in European Conference on Computer Vision. Springer, 2014, pp.33–47.

[10] H. Noh, S. Hong, and B. Han, “Learning deconvolution network forsemantic segmentation,” in Proceedings of the IEEE internationalconference on computer vision, 2015, pp. 1520–1528.

[11] J. Zhao, M. Mathieu, R. Goroshin, and Y. Lecun, “Stacked what-whereauto-encoders,” arXiv preprint arXiv:1506.02351, 2015.

[12] K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuytelaars,“Deep reflectance maps,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2016, pp. 4508–4516.

[13] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deepconvolutional encoder-decoder architecture for image segmentation,”IEEE transactions on pattern analysis and machine intelligence,vol. 39, no. 12, pp. 2481–2495, 2017.

[14] G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions usingconvolutional neural networks,” in European Conference on ComputerVision. Springer, 2016, pp. 728–743.

[15] A. Jain, J. Tompson, Y. LeCun, and C. Bregler, “Modeep: A deeplearning framework using motion features for human pose estimation,”in Asian conference on computer vision. Springer, 2014, pp. 302–315.

[16] M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng, “Recurrent 3dpose sequence machines,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017, pp. 810–819.

[17] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu, “Joint action recognition andpose estimation from video,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2015, pp. 1293–1301.

[18] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for humanpose estimation in videos,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2015, pp. 1913–1921.

[19] J. Song, L. Wang, L. Van Gool, and O. Hilliges, “Thin-slicing network:A deep structured model for pose estimation in videos,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 4220–4229.

[20] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow:Large displacement optical flow with deep matching,” in Proceedingsof the IEEE International Conference on Computer Vision, 2013, pp.1385–1392.

[21] Y. Luo, J. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin,“Lstm pose machines,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2018, pp. 5207–5215.

[22] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutionalpose machines,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2016, pp. 4724–4732.

[23] M. Siam, S. Valipour, M. Jagersand, and N. Ray, “Convolutionalgated recurrent networks for video segmentation,” in 2017 IEEEInternational Conference on Image Processing (ICIP). IEEE, 2017,pp. 3090–3094.

[24] C. Payer, D. Stern, T. Neff, H. Bischof, and M. Urschler, “Instancesegmentation and tracking with cosine embeddings and recurrenthourglass networks,” in International Conference on Medical ImageComputing and Computer-Assisted Intervention. Springer, 2018, pp.3–11.

[25] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, andJ. Yosinski, “An intriguing failing of convolutional neural networksand the coordconv solution,” arXiv preprint arXiv:1807.03247, 2018.

[26] C. Li, M. Z. Zia, Q.-H. Tran, X. Yu, G. D. Hager, and M. Chandraker,“Deep supervision with shape concepts for occlusion-aware 3d objectparsing,” arXiv preprint arXiv:1612.02699, 2016.

[27] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for cnn: Viewpointestimation in images using cnns trained with rendered 3d modelviews,” in The IEEE International Conference on Computer Vision(ICCV), December 2015.

[28] M. Z. Zia, M. Stark, and K. Schindler, “Towards scene understandingwith detailed 3d object representations,” International Journal ofComputer Vision, vol. 112, no. 2, pp. 188–203, 2015.

[29] X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network forobject landmark localization,” in European Conference on ComputerVision. Springer, 2016, pp. 52–70.

[30] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1231–1237, 2013.