depth-based image mosaicing for both static and dynamic...

Depth-based Image Mosaicing for Both Static and Dynamic Scenes

Qi ZhiCentre for Intelligent Machines

McGill [email protected]

Jeremy R. CooperstockCentre for Intelligent Machines

McGill [email protected]

Abstract

Traditional image-based mosaicing deals with theproblem of parallax by imposing constraints of aparallax-free camera configuration or requiring a densesampling of the scene. These solutions are often im-practical or fail to address the needs of the application.Instead, taking advantage of depth cues and a criterionof smooth transitions, we achieve significantly improvedmosaicing results for static scenes, coping effectivelywith non-trivial parallax in the input. Furthermore, byincorporating a criterion of consistent motion percep-tion, we demonstrate progress on mosaicing of dynamicscenes without introducing artifacts. Although furtheradditions are required to cope with unconstrained ob-ject motion, our algorithm can synthesize perceptuallyconvincing dynamic mosaics, conveying the same ap-pearance of object motion as seen in the original se-quences.

1. Introduction

Traditional image mosaicing techniques operate byfirst aligning the inputs and then warping and stitchingthem together. However, in the presence of considerabledisparity variance or dynamic objects in the scene, im-age registration is frustrated by parallax effects and ob-ject motion. This results in misalignments, which leadto artifacts in both static and dynamic mosaics.

In order to cope with the parallax problem, mostalgorithms impose constraints of either a parallax-freecamera configuration [3][5][12], or a dense samplingof the scene, such as that provided by manifold mo-saics [9][14]. However, these requirements may be im-practical or prohibitively expensive to satisfy.

The application of mosaicing to dynamic video se-quences was first proposed by Irani [7] and Sawh-ney [11]. Both approaches were based on the assump-tion of a parallax-free input video, provided by a singlerotating camera, which cannot capture dynamic eventscontinuously in both spatial and temporal dimensions.

Rav-Acha et al. [1] described the production of non-chronological mosaic video. Although effective for itspurpose, this unfortunately does not preserve the per-ception of chronologically continuous motion for ob-jects in the scene.

To overcome the aforementioned problems and lim-itations, we introduce a novel image mosaicing algo-rithm, which considers 2D image mosaicing as a depth-based view synthesis problem. When integrated withforeground-background segmentation and motion per-ception analysis, our algorithm can generate reasonabledynamic mosaics when given inputs exhibiting non-trivial parallax.

The remainder of this paper is organized as follows.Section 2 describes our depth-based image mosaic ap-proach, on which the dynamic mosaic algorithm, de-scribed in Section 3, relies. Section 4 provides a com-parison of experimental results with those of Autostitch,and Section 5 concludes with a discussion of desirableimprovements.

2. Depth-Based Image Mosaicing for aStatic Scene

In contrast with traditional mosaicing techniques, wesynthesize the panorama as if seen through a virtualcamera with a wider field-of-view (FOV) than the in-puts. This assumes availability of a depth map of theentire scene, a requirement we discuss in further detailbelow. Provided that the depth estimates are reasonable,we may build a panorama free of parallax-related arti-facts. While traditional view synthesis techniques onlyconsider content that is viewed by multiple input cam-eras, i.e., is contained in overlapping regions of the in-put where stereo information is available, our approachemploys a depth propagation procedure to include thecontents of non-overlapping regions, visible only to asingle camera, as well.

2.1 Synthesis of overlapping regionsMosaicing in the overlapping regions, Ro, is per-

formed using the plane sweep algorithm [6]. Given

a selected position and orientation of the virtual cam-era, the inputs are then projected onto parallel planeslocated at different depths to generate a set of interme-diate images. Each intermediate image contains RGBcolor channels and an associated matching score chan-nel that denotes the similarity between the projectionsfrom different input images.

Let P be the set of pixels in the output mosaic, andL be the set of depth levels. We apply the graph cut [2]algorithm to estimate color and depth information forevery pixel in the overlapping regions of the mosaic im-age by minimizing the following equation:

E(f) = Edata + Esmoothness (1)

where labelling f : P → L assigns each pixel of thesynthesized frame a discrete depth level, Edata repre-sents the color consistency between various sources andEsmoothness indicates the smoothness of the transitionbetween estimated depths of neighboring pixels in thevirtual image.

2.2 Synthesis of non-overlapping regionsBecause of the lack of stereo correspondence in-

formation, mosaic pixels of non-overlapping regions,Rnon, must be calculated by a different method fromthat applied to Ro. We observe that depth disconti-nuities rarely occur in regions of uniform texture buttypically coincide with color segment boundaries. Tak-ing advantage of this fact, we propagate the (reliable)depth information of color segments from Ro to adja-cent color segments in Rnon, provided that this resultsin the appearance of a smooth transition between them.

Synthesis of the mosaic in Rnon is actually a pro-cedure that maps color segments of the output mo-saic to their correspondences in intermediate images,{Idi}Ni=1, built during synthesis of the mosaic in Ro.

Let S denote the set of color segments {s1, s2, . . . , sM}in Rnon and L be the set of depth levels {d1, · · · , dN}.Based on the assumption of uniform depth, a greedy al-gorithm is used to calculate the best labelling ρ : S → Lthat minimizes the energy function:

E(ρ) = Esmoothness + Eocclusion (2)

The first term, Esmoothness, evaluates the overall con-nection cost between neighboring color segments as fol-lows:

Esmoothness =M∑i=1

∑(p,q)∈Ψ

Ci(p, q) (3)

where Ψ represents the border areas between the colorsegment si and its neighbors, and

∑(p,q)∈Ψ Ci(p, q),

with respect to one color segment si (si ∈ S), isthe total smooth connection cost [8] of all pairs of

neighboring pixels, (p, q), within Ψ. The second term,Eocclusion, accounts for occlusion by applying a con-stant penalty value, λocc, for each occluded si.

Starting with the subset of segments SM1 in Rnon

and immediately adjacent to Ro, all depth candidatesare tested for each si ∈ SM1, noting the best candi-date. The update of depth values is reserved until theend of the iteration, and is applied only if the overallenergy of the entire group is improved over the currentdepth values obtained using ρM1 = {ρ(si), si ∈ SM1}.This process is performed until either the change of to-tal connection cost between iterations is insignificant orthe number of iterations exceeds a threshold. Then, theprocess is applied to the immediate neighbors of SM1

for which depth estimates have not yet been computed.This continues until no unprocessed color segments re-main. Once depth estimates are obtained, the mosaicinRnon is rendered by copying the corresponding colorsegments from intermediate images into the mosaicingimage plane.

3. Depth-based Dynamic Mosaics

We now turn to the problem of generating a per-ceptually correct mosaic that includes moving objectsin the scene. Unlike earlier dynamic mosaicing ap-proaches [7][11][1], which employ a single rotatingvideo camera to acquire the scene content, we use amultiple camera configuration, with a large baseline.The resulting parallax effects pose a significant chal-lenge to the mosaicing task.

Our approach first performs a segmentation of fore-ground and background layers using a Mixture-of-Gaussians (MoGs) model [13], which offers robust-ness to potentially complex illumination conditions,such as non-uniform lighting and dynamic shadows. Itthen projects these layers separately onto the mosaicingplane, according to their respective depth estimates, torender the final result.

3.1 Foreground-background segmentation

Suppose the intensity of each pixel in the frames sat-isfies the distribution of an MoG model, which containsK = 3 Gaussian elements in our present implementa-tion. For a given frame, the probability that each pixelbelongs to the background is calculated according to thetrained models. If this probability exceeds some thresh-old, the pixel is considered as an element of the staticbackground, and otherwise, as a dynamic foregroundobject.

In addition to the binary foreground-backgroundsegmentation result, the background image, a weightedsum of the means of each Gaussian element from theMoG, is also constructed. This image retains only the

Figure 1. Construction of the mosaicbackground

static portions of the scene while dynamic foregroundobjects are removed. Two such background imagesare pictured in Figure 1a-b, along with the depth-basedmosaic (DBM) based on these, seen in Figure 1d. Amulti-band blending strategy [4] is applied to balancethe color differences between camera responses, withthe result shown in Figure 1e.

3.2 Rendering of dynamic mosaic

The foreground mosaic in Ro as observed by the vir-tual mosaicing camera is synthesized in the same man-ner as that used in the static case, as described in Sec-tion 2.1. However, for Rnon, where no stereo infor-mation is available to estimate depth values, a differentprocess is necessary.

People understand motion as a sequence of in-stances [10], which are sensitive to characteristics ofthe motion trajectories, specifically, velocity and its firstderivative, as they appear in the 2D video sequences.Furthermore, 2D trajectories with the same sequenceof instances are perceived as corresponding to the samemotion, regardless of the poses of the correspondingvideo cameras.

As in the case of static mosaicing, for which we re-quire the appearance of a smooth transition betweencontents from different sources, here, for dynamic mo-saicing, we require the consistent perception of motionif objects move across the FOVs between different cam-eras. So that, in Rnon, if the output mosaic video ob-served by the virtual mosaicing camera, preserves thesame velocities as the input video, thus maintains thesame sequence of instances, this output mosaic videoensures an identical perception of motion, as well asspatiotemporal motion consistency as presented in in-put video.

These principles motivate our approach to the con-struction of mosaic video. Construction of the t’thframe of foreground contents in Rnon is a procedurethat projects the foreground layer onto the mosaicing

image plane according to its proper depth estimate, d(t).This should be done in such a manner that the motiontrajectory in the output mosaic preserves the same ve-locity as that in the input. Finally, a simple merging ofthe background mosaic, as in the example of Figure 1e,with the foreground mosaic video, results in the finaldynamic mosaic.

4. Experimental Results

We are unaware of any comparable mosaicing tech-nique that has proven capable of addressing non-trivialparallax effects from sparse sampling in either staticor dynamic scenes. As a well-known representative ofconventional mosaicing techniques, we use Autostitchas a comparison against which to evaluate the quality ofour algorithms.

To obtain an alignment ensuring the best matchingperformance in Ro between sources, Autostitch mustdeform the input images, compressing objects closer tothe cameras and expanding distant objects to equalizefor their respective disparities. The effects of this pro-cess can be seen in Figure 2(I-e). In contrast, our depth-based image mosaicing results, shown in Figure 2(I-f),are free of such deformations. Furthermore, our mo-saics do not suffer from the unpleasant ghosting effectsseen in the Autostitch result, as indicated by the high-lighted bounding boxes in Figure 2(I-e), stemming frommisalignments due to parallax effects. Note that theseartifacts are evident, even after applying deformationsto compensate for disparity variance.

For further comparison, a reference DBM result gen-erated using the ground truth depth values from theteddy data set is shown in Figure 2(II-a). The over-lapping regions, i.e., within the black boundaries ofFigure 2(II-b), exhibit reasonable coherence with thereference mosaic, although appearance differences dueto depth estimate variance are observed in the non-overlapping regions, in particular toward the right of themosaic result.

It bears comment that unlike view synthesis algo-rithms, our DBM approach does not attempt to deter-mine the real depth in non-overlapping regions. In-deed, with the naive assumption of uniform depth ofeach color segment in Rnon, the DBM method gen-erates depth estimates that usually do not conform tothe ground truth topology of most scenes. Nevertheless,the smooth appearance connection criterion guaranteesa resemblance between local regions in the mosaic re-sults with those of the inputs. As such, the mosaicingoutputs based on these depth estimates still appear rea-sonable and perceptually acceptable.

For video inputs containing moving foreground ob-jects, Autostitch, like other traditional image-based mo-

Figure 2. Comparison of our depth-based mosaicing algorithm to Autostitch.

saicing algorithms, generates results in which consec-utive frames may exhibit jitter, as seen in the differ-ence frame in Figure 2. Furthermore, the ghost errorsin frame70, resulting from parallax, are a significant is-sue even for a single mosaic image. By comparison,our depth-based dynamic mosaicing approach producesresults that are free of parallax-related artifacts.

5. Conclusion and Future WorkTraditional algorithms are prevented from generating

perceptually acceptable panoramas under fairly com-mon conditions. In response, we introduced techniquesthat treat image mosaicing as a view synthesis problemthat must exploit depth information. We demonstratedthe use of a smooth motion perception criterion, whichguarantees not only the appearance of correct motionbut also motion consistency in both spatial and tempo-ral dimensions. Our algorithms are applicable to bothstatic and dynamic scenes.

The results presented here are, of course, only astart. Considerable work remains, in particular to copewith arbitrary movement of multiple objects in thescene. Furthermore, dynamic mosaicing at video rates,requires taking advantage of the parallel computationabilities of a GPU, or exploiting the efficient depth mapgeneration abilities of pre-calibrated stereo cameras orlaser rangefinders.

References

[1] R. Alex, Y. Pritch, D. Lischinski, and S. Peleg. Dynamo-saicing: Mosaicing of dynamic scenes. IEEE Trans.PAMI, pages 1789–1801, 2007.

[2] Y. Boykov and V. Kolmogorov. An experimental com-parison of min-cut/max-flow algorithms for energy min-

imization in vision. In IEEE Trans. PAMI, pages 1124–1137, 2004.

[3] M. Brown and D. Lowe. Recognising panoramas. In Int.Conf. Comput. Vision (ICCV), pages 1218–1225, 2003.

[4] P. Burt and E. Adelson. A multiresolution spline withapplication to image mosaics. Transactions on Graph-ics, pages 217–236, 1983.

[5] S. Chen. Quicktime VR: an image-based approach tovirtual environment navigation. In SIGGRAPH, pages29–38. ACM, 1995.

[6] R. Collins. A space-sweep approach to true multi-imagematching. IEEE Trans. PAMI, pages 358–363, 1996.

[7] M. Irani, P. Anandan, and S. Hus. Mosaic based rep-resentation of video sequence and their applications. InInt. Conf. Comput. Vision (ICCV), pages 605–611, 1995.

[8] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bo-bick. Graphcut textures: Image and video synthesis us-ing graph cuts. ACM Transactions on Graphics, SIG-GRAPH 2003, 22:277–286, July 2003.

[9] S. Peleg and J. Herman. Panoramic mosaics by manifoldprojection. In CVPR, pages 338–343, 1997.

[10] C. Rao, A. Yilmaz, and M. Shah. View-invariant rep-resentation and recognition of actions. Int. J. Comput.Vision, 50(2):203–226, 2002.

[11] H. Sawhney and S. Ayer. Compact representations ofvideos through dominant and multiple motion estima-tion. Trans. PAMI, pages 814–830, 1996.

[12] H. Shum and R. Szeliski. Construction and refinement ofpanoramic mosaics with global and local alignment. InInt. Conf. Comput. Vision (ICCV), pages 953–960, 1998.

[13] C. Stauffer and W. Grimson. Adaptive background mix-ture models for real-time tracking. In CVPR ’99, pages246–252, 1999.

[14] A. Zomet, D. Feldman, S. Peleg, and D. Weinshall. Mo-saicing new views: The crossed-slits projection. IEEETrans. PAMI, pages 741–754, 2003.

depth-based image mosaicing for both static and dynamic...

Documents