packnet-sfm: 3d packing for self-supervised monocular ... · packnet-sfm: 3d packing for...

14
PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares , Ambrus , * Sudeep Pillai * Adrien Gaidon Toyota Research Institute (TRI) [email protected] Abstract Densely estimating the depth of a scene from a single image is an ill-posed inverse problem that is seeing excit- ing progress with self-supervision from strong geometric cues, in particular from training using stereo imagery. In this work, we investigate the more challenging structure- from-motion (SfM) setting, learning purely from monocular videos. We propose PackNet - a novel deep architecture that leverages new 3D packing and unpacking blocks to effec- tively capture fine details in monocular depth map predic- tions. Additionally, we propose a novel velocity supervision loss that allows our model to predict metrically accurate depths, thus alleviating the need for test-time ground-truth scaling. We show that our proposed scale-aware architec- ture achieves state-of-the-art results on the KITTI bench- mark, significantly improving upon any approach trained on monocular video, and even achieves competitive perfor- mance to stereo-trained methods. Video and code of our approach are made available at the links below. 1. Introduction Perceiving depth is fundamental for many tasks such as perception, navigation, and planning. Depth from monoc- ular imagery can provide useful cues for a wide array of tasks [27, 17, 25, 22]. However, learning monocular depth via direct supervision requires additional sensors and pre- cise calibration. Self-supervised methods do not suffer from these limitations, as they use geometrical constraints on the raw imagery as source of supervision. In this work we address the problem of jointly estimating scene structure and camera motion across sequences of RGB images, com- monly referred to as Structure-from-Motion (SfM). While recent works in self-supervised monocular depth estimation have mostly focused on engineering the loss function [39, 44, 24, 3], we show that performance in this self-supervised SfM regime critically depends on the model architecture, in line with the observations of [20] for other self-supervised tasks. Going beyond general models like * Authors contributed equally https://youtu.be/-N8QFtL3ees https://github.com/ToyotaResearchInstitute/packnet-sfm Figure 1: Scale-Aware Monocular Depth with PackNet-SfM. We illustrate the strong qualitative performance of our scale-aware depth estimation method with PackNet trained purely on monocu- lar data. Our method is able to predict accurate metric depth maps with fine details and crisp boundaries, as illustrated by the depth error map (w.r.t. ground-truth from lidar) in the 3rd row. ResNet [15], our main contribution is a new convolutional network architecture, called PackNet, specifically designed for self-supervised monocular depth estimation. We pro- pose new packing and unpacking blocks that are inspired by sub-pixel convolutions [31] and leverage 3D convolutions to maximally preserve dense appearance and geometric in- formation. Our second contribution is a novel loss that can leverage the camera’s velocity when available (e.g., for cars, robots, mobile phones) to solve the inherent scale ambigu- 1 arXiv:1905.02693v1 [cs.CV] 6 May 2019

Upload: others

Post on 28-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation

Vitor Guizilini* Rares, Ambrus, * Sudeep Pillai* Adrien GaidonToyota Research Institute (TRI)[email protected]

Abstract

Densely estimating the depth of a scene from a singleimage is an ill-posed inverse problem that is seeing excit-ing progress with self-supervision from strong geometriccues, in particular from training using stereo imagery. Inthis work, we investigate the more challenging structure-from-motion (SfM) setting, learning purely from monocularvideos. We propose PackNet - a novel deep architecture thatleverages new 3D packing and unpacking blocks to effec-tively capture fine details in monocular depth map predic-tions. Additionally, we propose a novel velocity supervisionloss that allows our model to predict metrically accuratedepths, thus alleviating the need for test-time ground-truthscaling. We show that our proposed scale-aware architec-ture achieves state-of-the-art results on the KITTI bench-mark, significantly improving upon any approach trainedon monocular video, and even achieves competitive perfor-mance to stereo-trained methods. Video† and code‡ of ourapproach are made available at the links below.

1. IntroductionPerceiving depth is fundamental for many tasks such as

perception, navigation, and planning. Depth from monoc-ular imagery can provide useful cues for a wide array oftasks [27, 17, 25, 22]. However, learning monocular depthvia direct supervision requires additional sensors and pre-cise calibration. Self-supervised methods do not suffer fromthese limitations, as they use geometrical constraints on theraw imagery as source of supervision. In this work weaddress the problem of jointly estimating scene structureand camera motion across sequences of RGB images, com-monly referred to as Structure-from-Motion (SfM).

While recent works in self-supervised monocular depthestimation have mostly focused on engineering the lossfunction [39, 44, 24, 3], we show that performance in thisself-supervised SfM regime critically depends on the modelarchitecture, in line with the observations of [20] for otherself-supervised tasks. Going beyond general models like

*Authors contributed equally†https://youtu.be/-N8QFtL3ees‡https://github.com/ToyotaResearchInstitute/packnet-sfm

Figure 1: Scale-Aware Monocular Depth with PackNet-SfM.We illustrate the strong qualitative performance of our scale-awaredepth estimation method with PackNet trained purely on monocu-lar data. Our method is able to predict accurate metric depth mapswith fine details and crisp boundaries, as illustrated by the deptherror map (w.r.t. ground-truth from lidar) in the 3rd row.

ResNet [15], our main contribution is a new convolutionalnetwork architecture, called PackNet, specifically designedfor self-supervised monocular depth estimation. We pro-pose new packing and unpacking blocks that are inspired bysub-pixel convolutions [31] and leverage 3D convolutionsto maximally preserve dense appearance and geometric in-formation. Our second contribution is a novel loss that canleverage the camera’s velocity when available (e.g., for cars,robots, mobile phones) to solve the inherent scale ambigu-

1

arX

iv:1

905.

0269

3v1

[cs

.CV

] 6

May

201

9

Page 2: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

ity in monocular SfM. Although optional, we show that thisloss acts as a regularizer for the main SfM objective and al-lows our networks to recover metrically accurate depth andpose, even without median ground-truth scaling.

Using the standard KITTI benchmark [11], we showthat our self-supervised learning approach, called PackNet-SfM, vastly improves over the state-of-the-art for depthprediction trained in the monocular setting. Further-more, PackNet-SfM is even competitive with methods self-supervised from stereo imagery, although the geometri-cal information available in the SfM case is significantlyweaker and noisier, thus confirming the robustness and ef-fectiveness of our PackNet architecture, as qualitatively il-lustrated in Figure 1.

2. Related WorkDepth estimation from a single image poses several chal-

lenges due to its inherent ill-posed nature. However, mod-ern ConvNets have dramatically changed the landscape ofmonocular depth estimation thanks to their ability to lever-age appearance patterns in large scale datasets.

Depth Network Architectures. Eigen et al. [9] proposedone of the earliest works in ConvNet-based depth estima-tion using a multi-scale deep network trained on RGB-Dsensor data to regress the depth directly from single im-ages. Subsequent works extended these network architec-tures to perform two-view stereo disparity estimation [26]using techniques developed in the flow estimation litera-ture [8]. Following [8, 26], Umenhofer et al. [33] appliedthese concepts to simultaneously train a depth and pose net-work to predict depth and camera ego-motion between suc-cessive unconstrained image pairs. Independently, densepixel-prediction networks [40, 2, 1, 23] have made tremen-dous progress towards improving the flow of informationbetween layers during the encoding and decoding stages.Fractional pooling [14] was introduced as a way to amor-tize the rapid reduction in spatial dimensions during down-sampling. Lee et al. [21] generalized the pooling func-tion to allow the learning of more complex and variablepatterns, including linear combinations and learnable pool-ing operations. Shi et al. [31] used sub-pixel convolutionsto perform Single-Image-Super-Resolution (SISR), synthe-sizing and super-resolving images beyond their input reso-lutions, while still operating at lower resolutions. Recentworks [29, 42] in self-supervised monocular depth estima-tion use this concept to super-resolve their estimates andfurther improve their depth estimation performance.

Self-Supervised Monocular Depth and Pose. As super-vised techniques for depth estimation advanced rapidly, theavailability of target depth labels became challenging, es-

pecially for outdoor applications. To this end, [10, 13] pro-vided an alternative strategy involving training a monocu-lar depth network with stereo cameras, without requiringground-truth depth labels. By leveraging Spatial Trans-former Networks [16], Godard et al [13] use stereo imageryto geometrically transform the right image plus a predicteddepth of the left image into a synthesized left image. Theloss between the resulting synthesized and original left im-ages is then defined in a fully-differentiable manner, usinga Structural Similarity [35] term and additional depth regu-larization terms, thus allowing the depth network to be self-supervised in an end-to-end fashion.

Following [13] and [33], Zhou et al. [43] generalizethis to self-supervised training in the purely monocular set-ting, where a depth and pose network are simultaneouslylearned from unlabeled monocular videos. Several meth-ods [39, 24, 3, 44, 42, 19, 34, 37] have advanced this line ofwork by incorporating additional loss terms and constraints.All these methods, however, take advantage of constraintsin monocular SfM that only allow the estimation of depthand pose up to an unknown scale factor, and rely on theground-truth LiDAR measurements to scale their depth es-timates appropriately for evaluation purposes [43]. Instead,in this work we show that, by simply using the instanta-neous velocity of the camera during training, we are ableto learn a scale-aware depth and pose model, alleviatingthe impractical need to use LiDAR ground-truth depth mea-surements at test-time.

3. Self-Supervised Scale-Aware SfMIn the self-supervised monocular SfM setting, we aim to

learn: (i) a monocular depth model fD : I → D, that pre-dicts the scale-ambiguous depth D = fD(I(p)) for everypixel p in the target image I; and (ii) a monocular ego-motion estimator fx : (It, IS) → xt→S , that predicts theset of 6-DoF rigid transformations for all s ∈ S given byxt→s = ( R t

0 1 ) ∈ SE(3), between the target image It andthe set of source images Is ∈ IS considered as part of thetemporal context. In practice, we use the frames It−1 andIt+1 as source images, although using a larger context ispossible. Note that in the case of monocular SfM both depthand pose are estimated up to an unknown scale factor.

3.1. Preliminaries

Following the work of Zhou et al. [43], we train the depthand pose network simultaneously in a self-supervised man-ner. In this work, however, we learn to recover the inverse-depth fd : I → f−1D (I) instead, along with the ego-motionestimator fx. Similar to [43], the overall self-supervisedobjective consists of an appearance matching loss term Lpthat is imposed between the synthesized target image It andthe target image It, and a depth regularization term Ls that

Page 3: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

ensures edge-aware smoothing in the depth estimates Dt.The objective takes the following form:

L(It, It) = Lp(It, It)�Mt + λ1 Ls(Dt) (1)

whereMt is a binary mask that avoids computing the pho-tometric loss on the pixels that do not have a valid mapping(i.e. pixels from the source image that do not project ontothe target image given the estimated target depth). Addi-tionally, λ1 enforces a weighted depth regularization on theobjective. The overall loss in Equation 1 is averaged per-pixel, pyramid-scale and image batch during training. Fig-ure 2 shows a high-level overview of our training pipeline.

Appearance Matching Loss Following [13, 43] thepixel-level similarity between the target image It and thesynthesized target image It is estimated using the StructuralSimilarity (SSIM) [35] term combined with an L1 pixel-wise loss term, inducing an overall photometric loss givenby Equation 2 below.

Lp(It, It) = α 1−SSIM(It,It)2 + (1− α) ‖It − It‖ (2)

While multi-view projective geometry provides strong cuesfor self-supervision, errors due to parallax or dynamic ob-jects in the scene have an undesirable effect incurred on thephotometric loss. Following [42], we treat such regions inthe image as outliers and clip the photometric loss valuesthat are 0.5σ above the mean photometric error (i.e. q ' 38percentile). This serves as a way to robustify the photomet-ric loss and subsequently improve the optimization process.

Depth Smoothness Loss In order to regularize the depthin texture-less low-image gradient regions, we incorporatean edge-aware term (Equation 3), similar to [13]. The lossis weighted for each of the pyramid-levels, and is decayedby a factor of 2 on down-sampling, starting with a weightof 1 for the 0th pyramid level.

Ls(Dt) = |δxDt|e−|δxIt| + |δyDt|e−|δyIt| (3)

3.2. Scale-Aware SfM

As previously mentioned, both the monocular depth andego-motion estimators fd and fx predict scale-ambiguousvalues, due to the limitations of the monocular SfM ob-jective. In other words, the scene depth and the cameraego-motion can only be estimated up to an unknown andambiguous scale factor. This is also reflected in the overalllearning objective, where the photometric loss is agnostic tothe metric depth of the scene. Furthermore, we note that allprevious approaches which operate in the self-supervisedmonocular regime [13, 10, 24, 3] suffer from this limita-tion, and resort to artificially incorporating this scale factor

View Synthesis

Photometric Loss

Velocity Supervision Loss

PoseConvNet

PackNet

Figure 2: PackNet-SfM: Our proposed scale-aware self-supervised monocular SfM network architecture. We use PackNetas the depth network along with a pose network whose transla-tion component is constrained by an additional velocity supervi-sion loss. The velocity supervision gradually informs the metricscale in the pose network which is eventually propagated to thedepth network so as to predict metric depths.

at test-time for evaluation purposes, using LiDAR ground-truth measurements.

Velocity Supervision Loss. Since instantaneous veloc-ity measurements are ubiquitous in most mobile systems to-day, we show that they can be directly incorporated in ourself-supervised objective to learn a metrically accurate andscale-aware monocular depth estimator. During training,we impose an additional loss Lv between the magnitudeof the pose-translation component of the pose network pre-diction t and the measured instantaneous velocity scalar vmultiplied by the time difference between target and sourceframes ∆Tt→s, as shown in Equation 4:

Lv(tt→s, v) =∣∣∣‖tt→s‖ − |v|∆Tt→s

∣∣∣ (4)

Our final scale-aware self-supervised objective loss Lscalefrom Equation 1 becomes:

Lscale(It, It, v) = L(It, It) + λ2 Lv(tt→s, v) (5)

This additional velocity loss allows the pose network tolearn metrically accurate pose estimates, subsequently re-sulting in the depth network learning metrically scaleddepth estimates. λ2 is a weight used to balance the differentloss terms. In Section 5.4, we present our experiments withboth the scale-ambiguous and scale-aware variants of thenetworks to highlight their differences.

4. PackNet: 3D Packing for Depth Estimation

In order to increase the receptive field, it is commonpractice in standard ConvNet architectures to aggressivelystride and pool in the convolutional layers, particularly at

Page 4: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

(a) Packing (b) Unpacking

Figure 3: Proposed packing and unpacking blocks. Packingeffectively replaces convolutional striding and pooling in the en-coder, while unpacking provides a novel feature upsampling mech-anism in the decoder.

early stages, to decrease computational complexity. How-ever, these architectural design choices have been shownto decrease model performance [14, 41] due to their lossynature. Similarly, traditional up-sampling strategies [7, 4]fail to propagate and preserve sufficient details at the de-coder layers to recover accurate depth predictions. There-fore, we propose a novel deep architecture, called PackNet,that introduces new 3D packing and unpacking blocks toeffectively preserve more information during the encodingand decoding processes in a CNN-based monocular depthencoder-decoder architecture. We first describe the differ-ent blocks of our architecture, then how they are integratedtogether in a single model.

4.1. Packing Block

The packing block (Figure 3a) folds the spatial dimen-sions of intermediate convolutional feature maps into ex-tra channels, thus capturing sub-pixel information acrossthe various encoding layers. The packing block effec-tively replaces 2D convolutional striding and pooling witha Space2Depth [31] operation followed by a 3D convolu-tional layer, that aggregates packed feature maps and pro-cesses spatial information after it has been folded into extrachannels. This is achieved by first reshaping the input ten-sor to produce a new 1-channel dimension, which after theoperation is concatenated back with the original channels.A final 2D convolutional layer is then used to reduce theaggregated convolutional feature maps to the desired num-ber of output channels.

4.2. Unpacking Block

Similarly, the unpacking block (Figure 3b) unfolds con-volutional feature channels back into spatial dimensions

# Layer Description K D Output Tensor Dim.0 Input RGB image 3×H×W

Encoding Layers1 Conv2d 5 - 64×H×W2 Conv2d 7 - 64×H×W3 Packing 3 - 64×H/2×W/24 ResidualBlock (x2) 5 - 64×H/2×W/25 Packing 3 8 64×H/4×W/46 ResidualBlock (x2) 3 - 128×H/4×W/47 Packing 3 8 128×H/8×W/88 ResidualBlock (x3) 3 - 256×H/8×W/89 Packing 3 8 256×H/16×W/1610 ResidualBlock (x3) 3 - 512×H/16×W/1611 Packing 3 8 512×H/32×W/32

Decoding Layers12 Unpacking 3 8 512×H/16×W/1613 Conv2d (9+12) 3 - 512×H/16×W/1614 Unpacking 3 8 256×H/8×W/815 Conv2d (7+14) 3 - 256×H/8×W/816 InvDepth (15) - - 4×H/8×W/817 Unpacking 3 8 128×H/4×W/418 Conv2d (5+17+Upsample(16)) 3 - 128×H/4×W/419 InvDepth (18) - - 4×H/4×W/420 Unpacking 3 8 64×H/2×W/221 Conv2d (3+20+Upsample(19)) 3 - 64×H/2×W/222 InvDepth (21) - - 4×H/2×W/223 Unpacking 3 8 64×H×W24 Conv2d (0+23+Upsample(22)) 3 - 64×H×W25 InvDepth (24) - - 4×H×W

Inverse Depth Output Scales#4 Depth2Space (16) - - 1×H/4×W/4#3 Depth2Space (19) - - 1×H/2×W/2#2 Depth2Space (22) - - 1×H×W#1 Depth2Space (25) - - 1×2H×2W

Table 1: Summary of the PackNet architecture, for self-supervised monocular depth estimation. The Packing and Un-packing blocks are described in Figure 3. Conv2d blocks includeGroupNorm [36] with G = 16 and ELU non-linearities. InvDepthblocks include a 2D convolutional layer with K = 3 and sigmoidnon-linearities. Each ResidualBlock is comprised of a sequenceof 3 2D convolutional layers with K = 3/3/1 and ELU non-linearities, followed by GroupNorm with G = 16 and dropout[32] of 0.5 in the final layer. Upsample is a 2-factor resizing oper-ation with nearest neighbor interpolation. Numbers in parenthesesindicate input layers, with + as concatenation in the channel di-mension.

during the decoding process, thereby leveraging the con-cept of sub-pixel convolutions [31] for effective and detail-preserving decoding. The unpacking block effectively re-places convolutional feature up-sampling, typically per-formed via a nearest-neighbor strategy or with learnabletransposed convolutional weights. In this block, we incor-porate the sub-pixel convolution (Depth2Space) [31] op-eration with an additional 3D convolutional operation, tofurther aggregate information across the decoding layers.Conversely to the packing block, 2D convolutions are usedto produce the required number of feature channels for the

Page 5: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

following 3D convolutional layer. This allows the block tofully exploit packed spatial information by promoting fea-ture aggregation across all three dimensions, via the use of3D convolutional kernels. The resulting feature maps arefinally super-resolved to the target dimension of the subse-quent layers.

4.3. Model Architecture

Our PackNet architecture for self-supervised monoculardepth estimation is given in details Table 1. Our encoder-decoder architecture incorporates several packing and un-packing blocks in the encoder and decoder parts respec-tively. Inspired by previous disparity network architec-tures [26], we supplement the encoding and decoding layerswith skip connections, further facilitating the flow of infor-mation and gradients throughout the network. The decoderproduces intermediate inverse depth maps. They are up-sampled before being concatenated with their correspond-ing skip connections and unpacked feature maps. This en-ables super-resolving representations incrementally gener-ated between layers, as described next.

4.4. Super-Resolved Inverse Depth Maps

We use the same fundamental concept of single-imagesuper-resolution [31] to super-resolve the intermediatedepth outputs across all 4 pyramid scales via super-resolution layers (Depth2Space layers in Table 1). Inter-estingly, recent works in monocular depth estimation [29,42] have leveraged sub-pixel convolutions to up-sampledepth predictions and improve overall performance.

5. Experiments5.1. Datasets

KITTI We use the KITTI [11] dataset for all our depthbenchmarks and evaluation. More specifically, we adoptthe training protocols used in Eigen et al. [9], and use theKITTI Eigen splits [9] that contain 22600 training, 888 vali-dation, and 697 test monocular images. Training sequencesare generated using a stride of 1 with backwards and for-ward contexts of also 1, meaning that the immediate pre-vious t − 1, current t and posterior t + 1 images are usedtogether as input for the monocular self-supervised learn-ing of depth and pose. Additionally, we perform speed fil-tering with a threshold of 2 m/s to remove image contextsthat observe little camera ego-motion, since depth cannot belearned under these circumstances.

CityScapes In addition to training on the KITTI dataset,we also experiment with pre-training the monocular depthand pose networks on the CityScapes dataset [5], beforefine-tuning on the KITTI dataset. This also allows usto explore the true motivation of self-supervised learning,i.e. how does the performance of our model behave when

trained with increasing amounts of unlabeled data. The sizeof the CityScapes training split is 88250 files.

5.2. Implementation Details

We use PyTorch [28] for all our experiments, with themodels trained across 8 Titan V100 GPUs. We use theAdam optimizer [18], with β1 = 0.9 and β2 = 0.999.The monocular depth and pose networks are trained for 200epochs, with a batch size of 4 and initial depth and poselearning rates of 0.0002 and 0.0005 respectively. As train-ing proceeds, the learning rate is decayed every 80 epochsby a factor of 2. We set the SSIM weight to α = 0.85, thedepth regularization weight to λ1 = 0.1 and, where appli-cable, the velocity-scaling weight to λ2 = 0.01.

Depth Network For monocular depth estimation, weuse the proposed PackNet architecture as specified in Ta-ble 1, with the super-resolved inverse depth regression lay-ers as described in Section 4.4. We use Group Normaliza-tion [36] throughout all implementations of our models.

Pose Network For camera ego-motion estimation, weuse the neural network architecture proposed by [43] with-out the explainability mask, which we found not to improveresults. Following [43], the pose network consists of 7 con-volutional layers followed by a final 1 × 1 convolutionallayer. The input to the network consists of the target viewIt and the context views IS , and the output is the set of 6DOF transformations between It and Is, for s ∈ S. To rep-resent rotations we use the Euler angle parameterization.

5.3. Depth Estimation Performance

We evaluate the depths estimated using the metrics de-scribed in Eigen et al. [9]. We summarize our resultsin Table 2 and illustrate their performance qualitatively inFigures 5 and 6. We show that the proposed PackNet-SfM architecture significantly outperforms previous meth-ods and establishes a new state-of-the-art for the task ofmonocular depth estimation, trained in the self-supervisedmonocular setting. Furthermore, as evidenced by previousworks [13], we show that by simply augmenting the tar-get KITTI dataset with an additional source of unlabeledvideos, such as the publicly available CityScapes dataset [5](CS+K), we are able to further improve monocular depthestimation performance. In contrast to previous state-of-the-art methods [3, 12, 38], that predominantly focus onmodifying the monocular Structure-from-Motion objective,we show that the proposed PackNet architecture can furtherbolster and establish a strong monocular depth estimationbaseline for future developments.

As indicated by Pillai et al. [29], we also observe an im-provement of depth performance at higher image resolution.Interestingly, the addition of more data (e.g. CS+K) forthe lower resolution model allows us to achieve similar andeven better results for some of the metrics as when process-

Page 6: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Method Supervision Resolution Dataset Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253

Godard et al. [12] S 640 x 192 K 0.115 1.010 5.164 0.212 0.858 0.946 0.974Godard et al. [13] S 640 x 192 CS + K 0.114 0.898 4.935 0.206 0.861 0.949 0.976SuperDepth [29] S 1024 x 384 K 0.112 0.875 4.958 0.207 0.852 0.947 0.9773Net [30] S 512 x 256 CS + K 0.111 0.849 4.822 0.202 0.865 0.952 0.978

SfMLearner [43] M 416 x 128 CS + K 0.198 1.836 6.565 0.275 0.718 0.901 0.960Klodt et al. [19] M 416 x 128 CS + K 0.165 1.340 5.764 - 0.784 0.927 0.970Vid2Depth [24] M 416 x 128 CS + K 0.159 1.231 5.912 0.243 0.784 0.923 0.970DF-Net [44] M 576 x 160 CS + K 0.146 1.182 5.215 0.213 0.818 0.943 0.978Struct2Depth [3] (Motion) M 416 x 128 K 0.141 1.026 5.291 0.215 0.8160 0.945 0.979Yang et al. [38] (Motion) M 832 x 256 K 0.131 1.254 6.117 0.220 0.826 0.931 0.973Godard et al. [12] (ImageNet) M 640 x 192 K 0.129 1.112 5.180 0.205 0.851 0.952 0.978

PackNet-SfM M 640 x 192 K 0.120 1.018 5.136 0.198 0.865 0.955 0.980PackNet-SfM M+v 640 x 192 K 0.120 0.892 4.898 0.196 0.864 0.954 0.980PackNet-SfM M 640 x 192 CS + K 0.117 1.263 5.144 0.195 0.874 0.957 0.981

PackNet-SfM M 1280 x 384 K 0.118 0.930 4.845 0.194 0.868 0.956 0.981PackNet-SfM M+v 1280 x 384 K 0.113 0.885 4.725 0.190 0.883 0.960 0.981

Table 2: Depth estimation performance of PackNet-SfM on the KITTI dataset. Single-view depth estimation results on the KITTIdataset [11] using the Eigen Split [9] for depths reported less than 80m, as indicated in [9]. For Abs Rel, Sq Rel, RMSE, and RMSE log,lower is better. For δ < 1.25, δ < 1.252 and δ < 1.253, higher is better. In the Dataset column, CS+K refers to pre-training on CityScapes(CS) and fine-tuning on KITTI (K). M and S refer to methods that train using monocular (M) and stereo images (S) respectively. M+vrefers to velocity supervision (v) in addition to monocular images (M). The Motion tag indicates the use of motion models to compensatefor dynamic objects, and ImageNet indicates pre-training on said dataset [6]. At test-time, all monocular methods (M) scale the estimateddepths using median ground-truth LiDAR depth.

ing data at higher resolutions. We attribute this to PackNet’sability to preserve and properly process spatial informationbetween layers, which allows the model to learn richer fea-tures that scale with data.

5.4. Scale-Aware Depth Estimation Performance

Previous methods in the self-supervised setting evaluatedepth by scaling their estimates to the median ground-truthdepth measured via LiDAR (i.e. Velodyne in the KITTIdataset case). In Section 3.2 we propose to also recoverthe metric scale of the scene from a single image by sim-ply measuring camera ego-motion and imposing a loss on

0 50 100 150 200

Number of KITTI Epochs

0.2

0.4

0.6

0.8

Abs

.R

el.

Abs. Rel. vs. Number of epochs

M (w/ GT scaling)M+v (w/o GT scaling)

0 50 100 150 200

Number of KITTI Epochs

0.0

0.2

0.4

0.6

0.8

δ<

1.25

δ < 1.25 vs. Number of epochs

M (w/ GT scaling)M+v (w/o GT scaling)

Figure 4: Learning scale-aware depth from velocity supervi-sion. Illustrated above are plots of the Abs. Rel. and δ < 1.25depth estimation metrics for a monocular with velocity supervi-sion (M+v) model, after each training epoch. Note that initially anunscaled model is learned (black line), that requires ground-truth(GT) for a proper evaluation. Once it stabilizes, scale-awarenessstarts to gradually be incorporated (red line) until eventually bothground-truth and velocity scaled estimates reach similar values.

the magnitude of translation for the pose network output. InTable 3 we refer to the PackNet-SfM method, that is ableto encode the metric scale of the scene and therefore doesnot require ground-truth scaling during evaluation. Inter-estingly, we note that the addition of the velocity supervi-sion loss term allows the network to learn metrically scaleddepth without degrading the quality of the solution. In Fig-ure 4, we show how the metric scale of depth predictionsare gradually learned during training.

5.5. Ablation Studies

To further study the performance improvements thatPackNet-SfM provides, we perform an ablative analysison the different architectural components introduced. Weablate the following components and report their corre-sponding depth performance in Table 3: (i) Without 3DConv.: The 3D convolutional layer is removed from boththe packing and unpacking blocks (Section 4); (ii) With-out SR: The super-resolved inverse depth regression mod-ule (Section 4.4) is removed and the network directly out-puts single-channel inverse depth estimates; (iii) WithoutVelocity-Scaling (Vel.): The velocity supervision loss (Sec-tion 3.2 is removed, and the resulting model is no longerscale-aware; and (iv) Without Ground-Truth Scaling (GT):The predicted estimates are no longer scaled at test-time us-ing median ground-truth Velodyne depth values. We em-phasize that this is only viable in the case where PackNet-

Page 7: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Input image PackNet-SfM Vid2Depth [24] SfMLearner [43] DF-Net [44]

Figure 5: Qualitative depth estimation performance. We illustrate the qualitative depth map predictions from a single RGB imageproduced by our proposed PackNet monocular depth architecture, and compare against previous methods, on the KITTI dataset (Eigen testsplit). Notably, our method is able to capture fine details and structure in the image (including vehicles, pedestrians, and thin poles), whileother methods tend to blur out their predictions. We hypothesize that this improvement is due to the preservation and proper processing ofspatial information between layers, during both the encoding and decoding stages.

Model Variant Vel. GT Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253

PackNet-SfM w/o 3D Conv. and SR X 0.128 1.102 5.286 0.207 0.852 0.949 0.977PackNet-SfM w/o 3D Conv. X 0.126 0.961 5.071 0.205 0.853 0.950 0.979PackNet-SfM w/o SR X 0.125 1.074 5.137 0.204 0.855 0.951 0.979PackNet-SfM X 0.120 1.018 5.136 0.198 0.865 0.955 0.980PackNet-SfM X 0.118 0.915 5.035 0.209 0.847 0.947 0.977PackNet-SfM X X 0.120 0.892 4.898 0.196 0.864 0.954 0.980

Table 3: Ablation studies on the PackNet architecture. The Vel. column indicates that the velocity loss (Equation 4) was used duringtraining. The GT column indicates that ground-truth scaling was applied at test-time; 3D Conv. indicates the addition of a 3D convolutionallayer to the packing and unpacking blocks (Section 4.3); and SR indicates the use of super-resolution on the output inverse depth maps(Section 4.4). All evaluation is performed on the KITTI dataset at resolution of 640 x 192.

SfM is trained with velocity supervision (M+v), since it isnow able to predict scale-aware depth from a single image.

We show that the base PackNet architecture, with pack-ing and unpacking blocks but without 3D Conv. and SR,already produces a strong baseline for the monocular depthestimation task. The inclusion of both 3D convolutions andsuper-resolution independently bolsters the overall depth

estimation task performance further, with the new state-of-the-art results being achieved by the complete architecture,as proposed in this work. Additionally, we show that, byfurther incorporating velocity supervision (Vel.) into theoverall PackNet-SfM architecture, we can achieve similarstate-of-the-art depth estimation results without relying onground-truth median depth scaling (GT) during evaluation.

Page 8: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Figure 6: High-resolution qualitative results of PackNet on the KITTI (left) and CityScapes (right) datasets. We illustrate the finedepth detail and crisp boundaries estimated by our proposed PackNet model on diverse set of scenarios. We show that our model is able toachieve state-of-the-art performance while providing strong generalization performance, which is equally critical for driving related tasks.

5.6. Qualitative Results

Qualitative results of the proposed PackNet-SfM ap-proach are shown in Figure 5 and compared with relatedmethods found in the literature. We note that our methodis able to capture fine details in the image, such as peopleor road signs, even at long distances, and that depth resultsare in overall significantly sharper. Figure 6 shows addi-tional qualitative results on high resolution images of theKITTI and CityScapes datasets, where we highlight detailssuch as fences or traffic cones, which are successfully re-constructed by our method. The ability to capture this in-formation with such a high degree of fidelity is a strong in-dicator of PackNet-SfM’s usefulness in any system wherescene depth is an important cue.

6. Conclusion

We propose a novel depth estimation network architec-ture called PackNet that leverages packing and unpackingblocks in order to preserve and effectively process the spa-tial information contained in images. Through extensiveexperiments, we show that our network architecture estab-lishes a new state of the art in self-supervised monoculardepth estimation on the publicly available KITTI dataset.Additionally, we show that by imposing an additional ve-locity scaling loss while training, we are able to recoverscale-aware depth from a single RGB image at test time,thus relieving the need for post-hoc ground-truth depth scal-ing. Moreover, we show that velocity-scaling also acts as amodel regularizer to further improve the overall metrics es-tablished in the KITTI depth prediction benchmark.

Page 9: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

References[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A

deep convolutional encoder-decoder architecture for imagesegmentation. IEEE transactions on pattern analysis andmachine intelligence, 39(12):2481–2495, 2017. 2

[2] A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan.Pixelnet: Representation of the pixels, by the pixels, and forthe pixels. arXiv preprint arXiv:1702.06506, 2017. 2

[3] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova. Depthprediction without the sensors: Leveraging structure for un-supervised learning from monocular videos. 2019. 1, 2, 3, 5,6

[4] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion:A flexible framework for fast and effective image restora-tion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 39:1256–1272, 2017. 4

[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 3213–3223, 2016. 5

[6] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2009. 6

[7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Trans.Pattern Anal. Mach. Intell., 38(2):295–307, Feb. 2016. 4

[8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas,V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox.Flownet: Learning optical flow with convolutional networks.In Proceedings of the IEEE international conference on com-puter vision, pages 2758–2766, 2015. 2

[9] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. InAdvances in neural information processing systems, pages2366–2374, 2014. 2, 5, 6

[10] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervisedcnn for single view depth estimation: Geometry to the res-cue. In European Conference on Computer Vision, pages740–756. Springer, 2016. 2, 3

[11] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meetsrobotics: The kitti dataset. The International Journal ofRobotics Research, 32(11):1231–1237, 2013. 2, 5, 6

[12] C. Godard, O. Mac Aodha, and G. Brostow. Digging intoself-supervised monocular depth estimation. arXiv preprintarXiv:1806.01260, 2018. 5, 6

[13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervisedmonocular depth estimation with left-right consistency. InCVPR, volume 2, page 7, 2017. 2, 3, 5, 6

[14] B. Graham. Fractional max-pooling. arXiv preprintarXiv:1412.607, 2015. 2, 3

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016. 1

[16] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In Advances in neural informationprocessing systems, pages 2017–2025, 2015. 2

[17] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning usinguncertainty to weigh losses for scene geometry and seman-tics. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7482–7491, 2018. 1

[18] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014. 5

[19] M. Klodt and A. Vedaldi. Supervising the new with the old:Learning sfm from sfm. In European Conference on Com-puter Vision, pages 713–728. Springer, 2018. 2, 6

[20] A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. arXiv preprintarXiv:1901.09005, 2019. 1

[21] C.-Y. Lee, P. Gallagher, and Z. Tu. Generalizing poolingfunctions in convolutional neural networks: Mixed, gated,and tree. In International Conference on Artificial Intelli-gence and Statistics (AISTATS), 2016. 2

[22] K.-H. Lee, G. Ros, J. Li, and A. Gaidon. Spigan: Privilegedadversarial learning from simulation. In ICLR, 2019. 1

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 3431–3440, 2015. 2

[24] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervisedlearning of depth and ego-motion from monocular video us-ing 3d geometric constraints. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 5667–5675, 2018. 1, 2, 3, 6, 7

[25] F. Manhardt, W. Kehl, and A. Gaidon. Roi-10d: Monocularlifting of 2d detection to 6d pose and metric shape. arXivpreprint arXiv:1812.02781, 2018. 1

[26] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,A. Dosovitskiy, and T. Brox. A large dataset to train con-volutional networks for disparity, optical flow, and sceneflow estimation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4040–4048, 2016. 2, 5

[27] J. Michels, A. Saxena, and A. Y. Ng. High speed obstacleavoidance using monocular vision and reinforcement learn-ing. In Proceedings of the 22nd international conference onMachine learning, pages 593–600. ACM, 2005. 1

[28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. In NIPS-W, 2017. 5

[29] S. Pillai, R. Ambrus, and A. Gaidon. Superdepth: Self-supervised, super-resolved monocular depth estimation. InarXiv preprint arXiv:1810.01849, 2018. 2, 5, 6

[30] M. Poggi, F. Tosi, and S. Mattoccia. Learning monoculardepth estimation with unsupervised trinocular assumptions.In 6th International Conference on 3D Vision (3DV), 2018.6

[31] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken,R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-age and video super-resolution using an efficient sub-pixelconvolutional neural network. In Proceedings of the IEEE

Page 10: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Conference on Computer Vision and Pattern Recognition,pages 1874–1883, 2016. 1, 2, 4, 5

[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: A simple way to prevent neu-ral networks from overfitting. Journal of Machine LearningResearch, 15:1929–1958, 2014. 4

[33] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,A. Dosovitskiy, and T. Brox. Demon: Depth and motion net-work for learning monocular stereo. In IEEE Conference oncomputer vision and pattern recognition (CVPR), volume 5,page 6, 2017. 2

[34] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learningdepth from monocular videos using direct methods. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2022–2030, 2018. 2

[35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli. Image quality assessment: from error visibility tostructural similarity. IEEE transactions on image process-ing, 13(4):600–612, 2004. 2, 3

[36] Y. Wu and K. He. Group normalization. In Computer Vision -ECCV 2018 - 15th European Conference, Munich, Germany,September 8-14, 2018, Proceedings, Part XIII, pages 3–19,2018. 4, 5

[37] N. Yang, R. Wang, J. Stuckler, and D. Cremers. Deepvirtual stereo odometry: Leveraging deep depth predic-tion for monocular direct sparse odometry. arXiv preprintarXiv:1807.02570, 2018. 2

[38] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia. Everypixel counts: Unsupervised geometry learning with holistic3d motion understanding. In European Conference on Com-puter Vision, pages 691–709. Springer, 2018. 5, 6

[39] Z. Yin and J. Shi. Geonet: Unsupervised learning of densedepth, optical flow and camera pose. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), volume 2, 2018. 1, 2

[40] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual net-works. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), July 2017. 2

[41] H. Zhang and J. Ma. Hartley spectral pooling for deep learn-ing. Computing Research Repository, abs/1810.04028, 2018.3

[42] L. Zhou, J. Ye, M. Abello, S. Wang, and M. Kaess. Unsuper-vised learning of monocular depth estimation with bundleadjustment, super-resolution and clip loss. arXiv preprintarXiv:1812.03368, 2018. 2, 3, 5

[43] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-pervised learning of depth and ego-motion from video. InCVPR, volume 2, page 7, 2017. 2, 3, 5, 6, 7

[44] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised jointlearning of depth and flow using cross-task consistency. In

European Conference on Computer Vision, 2018. 1, 2, 6, 7

Page 11: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth EstimationSupplementary Material

1. Backbones ComparisonThe proposed PackNet architecture introduces several

key components that are shown to significantly improvemonocular depth estimation performance over previousmethods, that rely on the ResNet [7] family of backbones forthe encoder. Here we provide further evidence of this claimby comparing the performance of several standard ResNetvariations for the task of self-supervised monocular depthestimation, under the same training conditions as the pro-posed architecture. From these results (summarized in Ta-ble 1) it is clear that PackNet achieves substantially betterperformance in all considered metrics, even when trainedfrom scratch (the velocity supervision loss was not consid-ered in these experiments).

2. Pose evaluationIn Table 2 we show the results of the proposed PackNet-

SfM method on the KITTI odometry benchmark [4]. Tocompare with related methods, we train our system fromscratch on sequences 00-08 of the KITTI odometry bench-mark, with exactly the same parameters as when trainingon the Eigen split for the KITTI depth benchmark. Forconsistency with related methods, we compute the Abso-lute Trajectory Error (ATE) averaged over all overlapping

5-frame snippets on sequences 09 and 10. Following [5], toevaluate our 3-frame model on 5-frame snippets we com-bine the relative transformations between the target frameand the first context frame into 5-frame long overlappingtrajectories, i.e. we stack fx (It, It−1) = xt→t−1 to createappropriately sized trajectories.

The ATE results are summarized in Table 2, with theproposed approach achieving competitive results. We com-pare with related methods, and additionally note that themethods trained in the monocular setting (M) are scaled attest-time using ground truth translation. Our method, onthe other hand, trained with the velocity supervision loss(M+v) does not require ground-truth scaling at test-time, asit is able to recover metrically accurate scale purely frommonocular imagery.

In Figure 1, we show qualitative trajectory estimates ofPackNet-SfM on the test sequences 09 and 10; similar plotsare shown in Figure 2 for the training sequences 00-08. Wegenerate the trajectory plots by evaluating our method onconsecutive 3-frame snippets and stacking the transforma-tions over the whole sequence as described above. In Fig-ure 1, we illustrate the variants of PackNet-SfM evaluatedin Table 2, and note that PackNet-SfM trained with the ve-locity loss is able to recover metric scale consistently across

Backbone Pretrained Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.252 δ < 1.253

ResNet18 0.145 1.457 5.842 0.226 0.820 0.937 0.973ResNet18 3 0.142 1.352 5.969 0.223 0.823 0.940 0.974

ResNet34 0.140 1.547 5.979 0.222 0.832 0.941 0.974ResNet34 3 0.141 1.359 5.605 0.219 0.834 0.943 0.974

ResNet50 0.141 1.208 5.668 0.221 0.828 0.939 0.974ResNet50 3 0.138 1.253 5.606 0.217 0.833 0.943 0.976

ResNet101 0.139 1.199 5.614 0.220 0.829 0.941 0.975ResNet101 3 0.140 1.397 5.901 0.219 0.833 0.944 0.975

PackNet 0.120 1.018 5.136 0.198 0.865 0.955 0.980

Table 1: ResNet Backbone comparisons. For the various ResNet backbones, we report monocular depth estimation resultson the KITTI dataset [4] using the Eigen Split [3], and how they compare to PackNet under the same training conditions. Forcompleteness, we provide results for both training from random initialization and with ImageNet [2] pre-training.

1

arX

iv:1

905.

0269

3v1

[cs

.CV

] 6

May

201

9

Page 12: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Method Supervision Resolution GT Seq. 09 Seq. 10

Godard et al. [5] M 640 x 192 X 0.023 ± 0.013 0.018 ± 0.014SfMLearner (Zhou et al. [11]) M 416 x 128 X 0.021 ± 0.017 0.020 ± 0.015DF-Net [12] M 576 x 160 X 0.017 ± 0.007 0.015 ± 0.009Klodt et al. [8] M 416 x 128 X 0.014 ± 0.007 0.013 ± 0.009Vid2Depth [9] M 416 x 128 X 0.013 ± 0.010 0.012 ± 0.011GeoNet (Yin et al. [10]) M 416 x 128 X 0.012 ± 0.007 0.012 ± 0.009Struct2Depth [1] M 416 x 128 X 0.011 ± 0.006 0.011 ± 0.010

PackNet-SfM M 640 x 192 X 0.009 ± 0.005 0.008 ± 0.007PackNet-SfM M+v 640 x 192 X 0.008 ± 0.005 0.008 ± 0.007PackNet-SfM M+v 640 x 192 0.012 ± 0.006 0.011 ± 0.007

Table 2: Average Absolute Trajectory Error (ATE) in meters on the KITTI Odometry Benchmark [4]: All methodsare trained on Sequences 00-08 and evaluated on Sequences 09-10. The ATE numbers are averaged over all overlapping5-frame snippets in the test sequences. M+v refers to velocity supervision (v) in addition to monocular images (M). The GTcheckmark indicates the use of ground-truth translation to scale the estimates at test-time.

0

100

200

300

400

500

600

-100 0 100 200 300 400

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-300

-200

-100

0

100

200

300

400

0 100 200 300 400 500 600 700

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

Figure 1: Pose evaluation on KITTI test sequences. Qualitative trajectory results of PackNet-SfM on test sequences 09 and10 of the KITTI odometry benchmark.

the entire trajectory.

3. Failure Cases

A well-known limitation of self-supervised learningfrom structure-from-motion cues is dealing with dynamicobjects, especially those objects moving at roughly the samespeed as the camera tend to be predicted at infinite depth[1, 6]. This is because these objects show no apparentmotion, so the re-projection error approaches zero if thesescene points are placed infinitely far away. While Casseret. al [1] propose a mechanism to deal with dynamic scenesusing instance masks of objects as privileged information,in this work we still rely on the static-world assumption totrain and learn the scene depth. Thus, our proposed ap-proach remains limited when addressing inconsistencies in

re-projection errors due to scene dynamics, as depicted inFigure 3. We hope to address these limitations in futurework, in order to further increase performance.

4. Supplementary VideoWe also provide a video showing further qualitative re-

sults obtained by the proposed PackNet-SfM architecture1.The depth and pose predictions of PackNet-SfM are used toreconstruct 3D point-clouds of sequence 05 of the KITTIdataset. To highlight the temporal consistency of our esti-mates, we use a rolling window of 10 accumulated point-clouds for plotting, and use the pose estimates to transformthe point-clouds into the same frame of reference. Whenplotting we remove scene points that are further than 50m

1The video can be found at https://youtu.be/-N8QFtL3ees

Page 13: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

-100

0

100

200

300

400

500

600

-200 -100 0 100 200 300

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-1500

-1000

-500

0

0 500 1000 1500

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

0

200

400

600

800

1000

-200 0 200 400 600 800

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-100

0

100

200

300

0 100 200 300 400 500

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

0

50

100

150

200

250

300

350

400

-200 -150 -100 -50 0 50 100 150 200

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-100

0

100

200

300

400

500

-300 -200 -100 0 100 200

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-100

0

100

200

300

-200 -100 0 100 200 300

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-100

-50

0

50

100

-200 -150 -100 -50 0

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

-200

-100

0

100

200

300

400

500

600

-400 -300 -200 -100 0 100 200 300 400

z [m

]

x [m]

Ground TruthPackNet M with GT

PackNet M+v with GTPackNet M+v without GT

Sequence Start

Figure 2: Additional pose evaluation on KITTI training sequences. Qualitative trajectory results of PackNet-SfM on trainsequences 00-08 of the KITTI odometry benchmark.

from the camera or higher than maximum LiDAR height(roughly 3m). We show that the ground plane and nearbyobjects are reconstructed with high fidelity. PackNet-SfM isable to consistently capture fine details of the curbs, fences,painted signs on the street and bricks throughout the run.Due to the smoothness of depth estimates, boundary arti-facts are still present at regions with strong discontinuities.Notably, the 3D reconstructions illustrated are metrically-accurate as they were trained with the proposed velocity-scaling loss. Finally, we note that each frame in the se-quence is independently inferred, and no temporal consis-tency measures were used to produce the video.

References[1] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova. Depth

prediction without the sensors: Leveraging structure for un-supervised learning from monocular videos. 2019. 2

[2] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2009. 1

[3] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. InAdvances in neural information processing systems, pages2366–2374, 2014. 1

[4] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets

Page 14: PackNet-SfM: 3D Packing for Self-Supervised Monocular ... · PackNet-SfM: 3D Packing for Self-Supervised Monocular Depth Estimation Vitor Guizilini * Rares, Ambrus, * Sudeep Pillai

Figure 3: Failure cases. Examples of objects with similar motion as the camera being projected to infinite depth, a well-known limitation of monocular depth estimation from image sequences.

robotics: The kitti dataset. The International Journal ofRobotics Research, 32(11):1231–1237, 2013. 1, 2

[5] C. Godard, O. Mac Aodha, and G. Brostow. Digging intoself-supervised monocular depth estimation. arXiv preprintarXiv:1806.01260, 2018. 1, 2

[6] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervisedmonocular depth estimation with left-right consistency. InCVPR, volume 2, page 7, 2017. 2

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016. 1

[8] M. Klodt and A. Vedaldi. Supervising the new with the old:Learning sfm from sfm. In European Conference on Com-puter Vision, pages 713–728. Springer, 2018. 2

[9] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervisedlearning of depth and ego-motion from monocular video us-ing 3d geometric constraints. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 5667–5675, 2018. 2

[10] Z. Yin and J. Shi. Geonet: Unsupervised learning of densedepth, optical flow and camera pose. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), volume 2, 2018. 2

[11] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-pervised learning of depth and ego-motion from video. InCVPR, volume 2, page 7, 2017. 2

[12] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised jointlearning of depth and flow using cross-task consistency. In

European Conference on Computer Vision, 2018. 2