edic research proposal 1 4d reconstruction: modelling the ... · a reconstruction is done and a...

EDIC RESEARCH PROPOSAL 1

4D Reconstruction: modelling the dynamic structureand appearance of a scene

Olivier Kung CvLab, I&C, EPFL

Abstract—Adding a temporal dimension to Multi View Stereo(MVS) systems enables the capture of dynamic scenes. As changesboth in appearance and structure occur over time, a framework isneeded to jointly model these variations. In turn, this would allowfor more accurate and coherent reconstructions and open doorsfor reconstruction from sparser temporal and spatial images,i.e. large internet photo collections. In addition to the thesisproposal, I present two papers about dense 4d reconstructionfocusing on the structural changes and provide insights to themain challenges in this field. A third paper, a more fundamentalcomputer vision concept about feature space analysis is alsopresented. Finally, a general framework which could handledense and sparse 4D reconstruction is discussed and applicationexamples are provided.

Index Terms—Multi view stereo, dynamic, 4D reconstruction.

I. INTRODUCTION

The concept of measuring geometric properties of objectsfrom projected images largely pre-dates the event of photog-raphy. Pioneers like Leonardo da Vinci worked back in 1480with perspective and central projection, manually reportingdistances. Digital photography and computational power allownowadays to automatically measure a wide array of propertiesof different scenes, from motion capture to 3D representation

Proposal submitted to committee: September 7th, 2010;Candidacy exam date: Septembre 13th, 2010; Candidacy examcommittee: Roger Hersh, Pascal Fua, Sabine Susstrunk.

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(R. Urbanke) (signature)

EDIC-ru/05.05.2009

of cities. It is a very active research field, pushed by thehuge availability of affordable cameras and the advantages ofpassive, non-intrusive capturing.

In a general setting, we are given as input a set of imagestaken from camera devices. Most techniques assume that thesecameras are calibrated, i.e. that a model of their intrinsicgeometry is known. This can be computed by capturing aknown pattern for single camera setup [32], or by constrain-ing correspondences in a multi view setting and optimizingwith bundle adjustment [24]. The captured scene may evolvethrough time both in structure and appearance which aredefined as:

Structure Geometrical properties (position, motion) of thescene, i.e. a time evolving surface

Appearance Result of the illumination and reflectanceproperties of the structure captured by the camera

As we can see, both terms are closely related. We canclassify the different existing techniques by the type of scenewhich is intended to be modelled.

Static structure and appearance:

• Markov MVS. We differentiate MVS system based ona Markov framework, as they have specific properties.There is a great number of algorithms for this task,with an up to date benchmark maintained by [22]. Thesetechnique generally produce a dense 3D reconstruction bymatching regions with similar appearance across images.A global optimization (GraphCut [3], Belief Propagation[27]) is often used with a prior to resolve ambiguitiesand provide regularization over the Markov model. Thecorrespondence search is performed either in image space(depthmap approach) or euclidean space (volumetric ap-proach). This is particularly efficient for small scalereconstruction.

• Feature MVS. An initial surface or cloud of point isobtained by matching features together. This limits thesearch space and possibly handles large scenes. Workby [19] creates an adaptive search space based on initialfeature matches which is then optimized. Work by [11]adopts an iterative refinement approach. With an efficientdescriptor, [29] produces directly a dense reconstruction.

• Other MVS Variational approaches [20] are often usedto refine an initial mesh. In a very different setting,photometry [15] approach uses controlled lighting tocapture a 3D scene.

• SLAM Navigation oriented methods use also featurematching to reconstruct a sparse [8] or dense [18] en-vironment. However they focus mainly on efficiency and


deal with large image collection. Even though appearanceand structure change over time, they are discarded and themethod focus on the static part of the scene.

• Internet Photo Collection Reconstructing cities fromlarge collections of internet images [2], [23], [24], [26]often use the techniques of feature MVS, for their abilityto handle large scenes. Despite great changes in structureand appearance, these are also mostly discarded. Thereconstruction relies then on clusters of images sharinglittle variations, which is often available when the imageset is large enough.Dynamic structure, static appearance:

• Markov MVS Extension of Markov MVS to includestructure dynamics. They tend to be computationally veryexpensive and are usually optimized only on a limitednumber of frames. Some rely on tracking features orusing optical flow, or scene flow [17], [30].

• Feature MVS Feature based approach are popular fordynamic sequences [19]. There are many techniqueswhich are specialized for human motion in controlledenvironment. Some of these techniques use prior such asdeformable templates [9], or refine sequences of meshes[7].

• Other MVS There are also extensions to the variationalapproaches [13]

• Internet Photo Collection A very interesting work ex-tends the large collection of internet images to includestructural changes [21]. The goal of this work it toassign time stamps to a large set of skyscraper images.A reconstruction is done and a probabilistic optimizationis carried on visibility constrains.Static structure, dynamic appearance:

• Internet Photo Collection There has not been much workon appearance change in the context of 3D reconstruction.On noticeable exception is the recent work [12] , whichperforms a PCA on a large set of reprojected image toobtain a basis for appearance.Dynamic structure and appearance:

• We are not aware of a work jointly modelling dynamicstructure and appearance.

A few comments: note that in MVS, it is assumed thatthe cameras are synchronized, thus capturing a static scene.Moreover, the assumption is that there is photometric con-sistency across views, which mostly holds for lambertianmaterial. In a scene which has been captured at a framerate high enough relative to the motion, appearance over timemostly changes only gradually, allowing some sort of trackingover frames, especially using robust descriptors [29]. Forlarge internet images collection, appearance is often discarded,assuming enough data to provide gradual appearance changes.Two recent works explore modelling individually structural orappearance changes for specific cases of large internet imagescollection.

This thesis proposal suggest the design of a frameworkfor dynamic scenes which jointly models structural and ap-pearances. This would enhance traditional DMVS as well asopening doors for new applications.

II. BACKGROUND PAPERS

In the this section, two recent papers about reconstruction ofdynamic scenes are presented to give insights of the challengesin the field. Both papers are focused on reconstructing thestructure of a dynamic scene, discarding appearance changeover time. The first one [31] takes advantage of physicalproperties of the scene to guide the reconstruction, and thesecond one [1] reasons more generally in terms of visibilityconstrains. Finally, a paper about feature space analysis [6]is also presented to provide more general background incomputer vision.

III. PHYSICALLY GUIDED LIQUID SURFACE MODELLINGFROM VIDEOS

The general context of this paper is that water animationfor computer graphics, while giving realistic results throughsimulation, is hard to control. The idea is to capture real waterscene, which is intuitive to control in the real world, through astereo system. Such a system does not capture the whole scene,and provides only sequences of partial 3d reconstruction whichare then completed by a physically based optimization.

An overview of the method is provided in Figure 1. It con-sists of two main stages, initialization and optimization. In thefirst stage, an initial shape sequence Ψ = {Ψt} is assembledby considering each time step t ∈ [0, T ] individually. Then thesecond stage produces a final 3D shape sequence Φ = {Φt}which is optimized to satisfy spatio-temporal coherence basedon physical laws. More formally, it can be written as

E(Φ) =

T∑t=0

(Ed(φt, ψt) + Es(ψt)) +

T−1∑t=0

En(φt, φt+1) (1)

with Ed calculates the similarity between the initial shape andthe reconstructed one, Es ensures spatial smoothness of theoutput shape, and En ensures that the shape follows the spatio-temporal physical constrains. First some background materialis given, before explaining the steps of the method more indetails.

Fig. 1. Overview of the algorithm. Image are given as input. A first stagegenerates a sequence of initial surfaces which are refined in an optimizationstage and produce a new sequence of shape

A. Background

1) Stereo: Stereo algorithms take as input a pair of imagesand produce a depthmap, i.e. the images with a discretizeddepth estimate at each pixel. The depths are estimated bycreating a cost volume which measures the correspondencescore for each hypothesised discrete depth. An global opti-mizing scheme (BP, GraphCut) is then used to find a surfacepassing through this volume with the minimal cost, adding aregularization parameter on the surface.


In the algorithm used in this paper, further enhancementsare obtained by post processing the initial surface for subpixeldepth accuracy and a two pass method detects occluded areas.

2) Water Simulation: There is a consensus among scientiststhat the Navier-Stokes equations are a very good model forfluid flow. However, computational cost for solving accuratelythis model is very high as it require an iterative scheme withtiny steps. A stable model has been proposed by [25] whichreproduces complex behaviours of fluids, but largely increasesthe steps size while remaining stable and thus reduces comput-ing costs. Evolution of a fluid can be described with a pressurefield p, a velocity field v and external forces f . In order tosolve the Navier-Stokes equations in an unconditionally stableway, [25] proposes to split the equations in four steps. Thefluid solver applies first external forces on vt−1,t. The nextstep accounts for the effect of advection of the fluid on itself.In other words it accounts for propagation of the fluid distur-bance, which is non linear. The third step damps the velocityflow by using viscosity diffusion. Finally, incompressibility isobtained by projecting the field in a divergence free space, dueto pressure projection. In other word, it assures that the vectorfield behaves neither like a sink or a source.

3) Signed Distance Function: Signed distance function(SDF) are used for implicit surface representation. A SDFmeasures the distance with respect to surface boundary, posi-tive if outside and negative inside. It has the main advantageover explicit mesh representation of being neutral to topologychanges and they are more suited for 3D flow estimationas they are defined in the whole euclidean space. A meshrepresentation can easily be extracted using the marching cubealgorithm.

Smoothing SDF is done using the surface mean curvature κand the Laplace-Beltrami operator ∆B under the target PDE∆Bκ(−→x ) = 0. By linearising the mean curvature along surfacenormal, a fourth order equation which preserves volume iscomputed on the SDF representation of the shape

Φs = ∆κ · ‖∇Φ‖ (2)

B. Method

1) Initialization: The input is a pair of high speed (200hz)calibrated camera, capturing a fluid scene. As the stereoalgorithm relies on texture, the water is opacified with ink anda video projector is used to generate texture on the surface.The stereo algorithm is used to provide a 2.5 dimensionrepresentation, a depthmap.

Simple heuristics about symmetry are used to convert thisdepthmap into an initial 3D surface sequence Ψ.

2) Optimization: The optimization stage refines the shapesequence Ψ both spatially and temporally. Before explainingthe four steps of figure 1 , more insights are given about thespatial smoothing and temporal coherence.

The surface is smoothed spatially according to formula 2,and additionally combined with a fidelity term on the initialshape to decrease Es and Ed.

Surface sequences are temporally optimized by using anestimated flow velocity to predict the surface shape fromneighbouring frame shapes Φt−1 and Φt+1. An interpolation

scheme is then used to refine the current shape with the twopredicted ones. Flow estimation is inspired by the techniquesused in fluid simulation that, given a shape sequence Φand a vector field Vt−1, compute the resulting vector fieldVt. Different from fluid simulation, both Φt and Φt+1 areassumed to be given, and the goal is to find the vector fieldVt−1,t. The main idea is to approximate the four steps ofwater simulation and to use them to guide the optimization fortemporal coherence. As shown in the diagram of figure 2 , theinitial step is to find 3D correspondences between shape Φt

and Φt+1 to get −→v t,t+1. It also implicitly includes externalforces and temporal continuity constrains. Then, the vectorfield is smoothed, similar to the viscosity diffusion. Finally,incompressibility is enforced in the same way as in the forwater simulation, by pressure projection.

It can be noted that searching for correspondences betweentwo shapes naively results in many ambiguities. The problemis then formulated as searching for correspondences with thesame signed distance function and the same local neighbour-hood. This is done both forward and backward in time, whichacts as velocity advection between velocity flows at differenttime.

However, in the initial sequence, their might be false pos-itive and false negative regions, i.e. parts which should bea surface and parts which are missing out. They have to beconsidered as they both results in missing correspondences inneighbouring frames. The idea is that water has some temporalcoherence and so if water has been detected in a region, itshould be present for a number of consecutive frames in aregion. An algorithm based on counters and mismatches inthe correspondence search is used to identify false positiveand negative.

To summarize, the first step is finding correspondence,which in turn allows to remove false positive. Temporalcoherence is then applied, interpolating predicted values usingthe velocity estimates of the first step. Finally the remainingmismatches can be considered as false negatives and missingwater regions can be filled back.

Fig. 2. Relation between the four steps of water simulation (top) and theapproximate steps of this method.

C. Discussion

The paper’s main contribution is to propose a frameworkwhich allows to guide a reconstruction process by taking fluidphysics in account. It is important to note that the physicsguide the reconstruction to a visually pleasing result, nota physically accurate one. As the coherence is done in arefinement step, it is not linked to a particular device, and morespecialized systems such as [14] could be used to capture theinitial shape.


Controlling fluids animation is a hard problem [28], and adata-driven approach has many useful applications. However,in their paper, the authors do not discuss on the scalability oftheir approach and show two small scale experiments capturedat a very high frame rate. It is hard to measure the effect ofthe approximation of the original stable solver. The techniqueto resolve ambiguities in the correspondence search may notsuffer from similar problems that optical flow for uniformmotions. The authors also included an ad-hoc technique tocorrect false positive and false negative regions. From theirexamples, it works for accurate depthmaps, but it is notguaranteed that it also works with more noisy input.

In fluid simulation, there is a trade-off between computationand accuracy, usually in term of step sizes and cell sizes. Thesmaller the steps relative to the cell size, the more accurate isthe simulation [25]. Similarly, one can question the outputof such a technique on a larger dataset taken at a muchlower frame rate. Another approach which could allow largersimulation could be to use Smoothed Particle Hydrodynamicsinstead of the implicit SDF representation.

Despite these shortcomings, this could be included in myown work about river simulation. The main differences is thatthe input is a noisy with a low frame rate cloud of points. Thiscaptures only the surface, the bottom is unknown. However,preprocessing on the river data could provide a flow and adepth estimate.

IV. GLOBALLY OPTIMAL SPATIO-TEMPORALRECONSTRUCTION FROM CLUTTERED VIDEOS

This work presents a dynamic Multi View Stereo system.The input is a set of images taken by synchronized cali-brated cameras, capturing a dynamic scene under uncontrolledimaging conditions. This method is based on the followingproperty: a surface is well approximated by the a subset offacets of a Delaunay triangulation (DT) of a cloud of points(with some additional constrains).

In 2D it means that once a cloud of 2D point is obtainedand a Delaunay triangulation computed, a subset of edges ofthis triangulation are a good approximation of the line whichwe want to recover. This subset can be found by labellingthe triangles of the DT into two sets, inside and outside. Thislabelling can be done efficiently by minimizing an energy overthe graph using GraphCut. This principle has been successfullyapplied in 3D by [16] and this paper is the 4D extension ofthis method.

The main advantages observed in the 3D case is that a cloudof point can easily be obtained by feature matching. Moreover,the Delaunay Triangulation allows an adaptive representationof the scene, and thus is appropriate for large scenes. Thegraphcut based labelling allows a global optimization of visi-bility and photo consistency with a smoothness prior.

This paper extends this framework in the dynamic case.The idea is to treat time as additional dimension to spaceand construct of 4D representation. The 4D cloud of point isthen the sequence of 3D cloud of points obtained individuallyat each frame. A Delaunay Triangulation yields pentatopes,the equivalent of tetrahedra in 4D. The goal is to find a

hyper surface which passes through a subset of facets of thesepentatopes. This is again obtained by labelling the pentatopesas inside and outside, using a graphcut minimization overan energy based on visibility and smoothness. The optimalhypersurface can then be considered at a certain time to get a3D surface of the scene.

A. Background

Fig. 3. (a) the Voronoi diagram of a set of points. (b) the Delaunaytriangulation, the dual of the Voronoi diagram

1) Delaunay triangulation: The Voronoi diagram of a set ofpoints E = p1, ..., pn ∈ R4, is the partition of space inducedby the Voronoi cells. Such a cell is the region associated to apoint, that is closer to this point than from all other points inE, as can be seen on fig 3 (a).

The Delaunay triangulation is defined as the geometricdual of the Voronoi diagram. An edge is drawn between twopoints of the Delaunay triangulation if there is a non-emptyintersection between their associated Voronoi cells. It yieldsa triangulation of E, which are triangles in 2D, tetrahedra in3D and pentatopes in 4D. The main property of the DT isthat the circum-circles of the triangles in 2D are empty. Thisproperty holds in any dimension and can be used for iterativelyconstructing the graph.

2) Graph Cut: Given a finite directed graph G = (V, E)with two special vertices, the source s and the sink t. Infor-mally, a cut C = (S, T ) partitions the graph into two subsetssuch that s ∈ S and t ∈ T . The cost of the cut is the sumof the capacity of all edges going from S to T . Finding theminimum cut is a well known problem which is equivalentto computing the maximum flow from the source to the sink.This can be seen as a binary labelling of the nodes, and veryefficient algorithms exist.

B. Method

The method consists of three main steps. First, features arematched to created individual 3D clouds of points and theclouds of points are merged in a 4D cloud and triangulatedusing DT to obtain a 4D mesh. The next step is to construct agraph of the DT triangulation. Finally, a minimal 4D surfaceis found by minimizing an energy over the graph. Optionallya 3D mesh at a given time can be extracted.

1) Clouds of points and Delaunay Triangulation: Givenmultiple video sequences, DOGs and Harris feature pointsare extracted for every images. These feature points arematched across images for each frame independently, usingepipolar constrains. The main idea is to extract a very high


number of features which may lead to multiple matches. Thesematches are pruned by computing a NCC similarity cost witha threshold. Finally the best m = 1, 2 candidates are kept andadded to the cloud of points of their corresponding frame. Theimportance of keeping multiple candidates comes from the factthat a hypersurface build out of a subset of facet of the DTpentatones is a good approximation to the real hypersurface ifthe original point of cloud is dense enough.

The individual cloud of points are merged into a single 4Dcloud of point by considering time as an additional dimension.However, computing metrics with this additional space isnot well defined. Homogeneity of this space is obtained byconsidering a global scale factor between spatial and temporaldimensions, and can be interpreted as a reference displacementper unit speed. A 4D DT is then computed on this cloud ofpoint.

2) Graph Construction: The goal is now to find the correctlabelling of the pentatopes of the 4D mesh as inside andoutside sets. This will separate the 4D mesh, and the facets ofthis separation form the desired 4D hypersurface S. In 4D, thefacets of the pentatopes are tetrahedra with 4D coordinates.

The first step is to construct a graph of neighbour pen-tatopes, which is obtained by creating an edge between eachpair of pentatopes which share a 2D or 3D face. To be ableto apply the graphcut algorithm, a link to the source and sinkis added to each vertex. The optimal labelling minimizes theenergy defined for the surface S

E(S) = Ev(S) + Es(S) (3)

where Ev is a visibility cost and Es a smoothness prior.The idea of the visibility term in 3D is that, if a point is

part of the surface, it should be visible by all the camerasfrom which it has been triangulated, i.e. no other facetsshould be in between. The real visibility cost is the numberof conflicts between all the camera to points rays. However,this involves too many inter node interactions and cannot besolved by GraphCut. Instead, an approximation is providedby computing the number of intersections with the orientedsurface. In 4D, as the rays between points and camera areon the same temporal plane. Intersection of rays with 4Dpentatopes can be computed by projecting the pentatopes onthe temporal plane. If the projected cell contains the camera,it should be labelled as outside. If it is behind the origin point,is should be marked as inside, and if a cell is crossed frominside to outside, it should be penalized. This is representedin figure 4.

The smoothness cost is the area of S. This enforces bothsmoothness in space and time.

C. Discussion

The main contribution of this paper is to extend a successfullarge scale 3D reconstruction algorithm to dynamic scenes. Itretains the main advantages of this algorithm: and adaptiveand memory efficient representation of the search space, aglobal optimization with surface smoothness prior dealing withvisibility constrains and the ability to reconstruct any type ofscene without a specific prior or initialization.

Fig. 4. p0 projects on the camera center so its weight to s is set to infinity.q1 to p1 crosses from inside to outside, so they are penalized. q2 is behindthe origin, so its weight to t is set to infinity

Adding a temporal dimension enables for points to propa-gate their position information through time, as the pentatopesare defined across multiple frames. This representation al-lows also the surface to be regularized over space and time.Minimizing temporally the surface find the shortest linearinterpolation between two frames. As a result, the mesh isboth spatially and temporally coherent. Moreover, it inherentlyproduces an interpolated result between frames.

However, compared the the static version, the photo con-sistency term has been removed from the optimization. Asstated earlier, points do propagate their position, however nophotometric consistency check is performed. The author doesnot state explicitly the reason, but the running time of thealgorithm may indicate that it is computationally prohibitive.This leads to the main problem of this technique: scalability.The experiments are done on 60 and 40 frames sequences(i.e. a few seconds) and are over 3 hours of processing. Themost costly part is the Delaunay triangulation which doesn’tscale linearly with the input. A possibility would be do runthe algorithm on a sliding overlapping temporal window.

Another large reconstruction which qualitatively performsbetter on the same test sequence is based on the Patch-basedMulti-view Stereo algorithm [19]. Instead of solving a globalproblem, it uses heuristics across space to expend and filter arefined mesh in a few iterations. For temporal, a static meshis first computed and its vertices are then tracked over frames.The range of possible dynamic scene reconstruction is morelimited, but on another hand it scales perfectly in time.

V. MEAN SHIFT

In their work, Comaniciu and Meer [6] propose a generalnonparametric technique for the analysis and clustering ofa multimodal feature space. It take its roots in the featurespace-based analysis of images paradigm. The main insight ofthis approach is that finding and grouping significant featuresof an image in a global way provides a robust method foranalysing an image. A pdf can be estimated in this featurespace. Significant features are the local maxima, or modesof this pdf. The authors present a method for finding thesesignificant features and delineate clusters around them basedon the mean shift algorithm and provide two applications onimage smoothing and segmentation.


A. Background

1) Feature Space: A feature space is a mapping of theinput and its nature is application dependent. This mapping isusually done in three steps: process the input in small subsets,choose a parametric representation of this subset and projectit as a point in the feature space. A typical example of sucha mapping is a sliding window computing a n quantized his-togram and projecting a point in the n dimensionality featurespace. The main advantage of this approach is that all evidencefor presence of a significant feature are pooled together in aglobal way, which makes it robust to noise. Significant featurescorrespond to high point density in the feature space. On theother hand, the choice of the parametric representation is nottrivial and needs to be carefully chosen for a given applicationin order for significant features correspond truly to salientregions of the image.

2) Kernel Density Estimation: Once that this feature spaceis computed, the goal is to find the significant features andto delineate clusters around them. An important note is thatthe structure of the feature space is arbitrary, and thus theparametric clustering algorithms (for example gaussian mix-tures) will perform poorly. In the nonparametric clusteringalgorithm, there are two main families: hierarchical approachand pdf approach. The first one is based on a topological graphstructure, is often computationally expensive and depends ona stopping criterion which is not clear to establish in ourcontext. The second approach relies on the assumption thatthe points of the feature space are sampled from an underlyingnon parametric distribution. The empirical pdf f is obtainedthrough kernel density estimation, also called Parzen estima-tion. This approximates the true pdf by estimating local densityof points, done by convolving the points with a multivariatekernel function K. Given n data points xi ∈ Rd.

f(x) =1

n

n∑i=1

K (x− xi) (4)

where the kernel K is radially symmetric. The kernel K isexpressed in terms of a piecewise continuous, monotonicallynonincreasing and positive 1d profile function k

K(x) = ck,dk(∥∥∥x

h

∥∥∥2) (5)

with a normalization constant ck,d which makes K(x) inte-grate to one. Parameter h is the bandwidth parameter whichdefines the radius of the kernel. An example of such a profileis kN (x) = exp

(− 1

2x)

which yields the multivariate normalkernel. The profile is symmetrically truncated to have finitesupport.

3) Density gradient Estimation: As previously stated, sig-nificant features are represented by dense regions in thefeature space and correspond to the modes of the underlyingdistribution. These local maxima are the stationary points ofthe gradient of this distribution. The key idea is that theempirical pdf doesn’t have to be explicitly estimated, as we areonly interested in its gradient. The gradient of superposition ofkernels, centred at each data point is equivalent to convolving

the data points with the gradient of the kernel.

∇f(x) =1

n

n∑i=i

∇K(x− xi) =c

n

n∑i=1

∇ki (6)

By defining a shadow profile as the derivative of the originalprofile

g(x) = −k′(x) (7)

and after a bit of calculus, we obtain

∇f =c

n

[n∑

i=1

gi

(∥∥∥∥x− xi

h

∥∥∥∥2)]

︸︷︷︸term 1

n∑

i=1

xigi

(∥∥x−xi

h

∥∥2)n∑

i=1

gi

(∥∥x−xi

h

∥∥2) − x

︸︷︷︸

term 2(8)

The first term is proportional to the density estimate at xcomputed with the shadow profile. The second term is themean shift vector x , i.e, the difference between the weightedmean and x, the center of the kernel. Note that it pointstowards the direction of maximum increase in density.

4) Mean Shift Procedure: The mean shift procedure for agiven point xi is as follows:

1) Compute the mean shift vector m(xti)

2) Translate density estimation window xt+1i = xt

i+m(xti)

3) Iterate 1. and 2. until convergence, i.e., ∇f(xi) = 0

B. Example and convergence

An application example of the presented framework isfor image segmentation. The feature space is a 3D spaceconsisting of the joint spatial and range parameter of theimage, i.e pixel position and grey intensity. For each pixel,the mean shift procedure is applied. That is, a kernel in the3D feature space is computed and iteratively shifted towardsits mean. When the shift step gets too small, the procedureis stopped and the current position is marked as the modeof the starting pixel. The feature space of an image’s area ispictured in figure 5 along with mean shift paths for the pixelon the plateau and on the line. Once the corresponding modeis found for each pixel, the feature space is delineated by firstgrouping all the modes which are close of each other, and thenassigning a different label for each basin of attraction.

Fig. 5. Visualization of the mean shift paths. Black dots are the points ofconvergences.

One of the main contribution of this paper is to demonstratethat the mean shift does converge for discrete data. Intuitively,the authors first demonstrate that the sequence of pdf estimates


monotonically increases. Then by using the property that theprofile is convex, the convergence is proved. These two resultsin hand and some algebra show that the mean shift sequencesare a Cauchy sequence, and thus it always converges.

The authors also show that the mean shift follows a smoothtrajectory. In this way, it contrasts with steepest ascent whichtends to zigzag. It shares also properties with M-estimator,which are robust techniques to outliers.

C. Discussion

Using the mean shift procedure for feature space analysis isvery elegant. It allows to find the modes of the underlying nonparametric distribution and to delineate them. It has only oneparameter, which is in a way related to the trade off betweenoverfitting and generalization to the data.

In a further work, [10] made link between the Newton-Raphson method a the mean shift procedure with a piecewiseconstant kernel. Later, it has been shown that the mean shiftprocedure with a Gaussian kernel is an EM algorithm [4]. Analgorithm which takes advantages of the EM properties hasbeen proposed for image segmentation which shows an orderof magnitude of speed up [5].

A remaining open question is the choice of feature space.Projecting in a feature space of a high dimensionality wherethe points are too much spread break the assumption of theempirical pdf. Moreover, to get good results, the differentdimensions should be comparable and scaled accordingly. Thispart requires still intuition for specific applications.

VI. RESEARCH PROPOSAL: 4D RECONSTRUCTION:MODELLING THE DYNAMIC STRUCTURE AND APPEARANCE

OF A SCENE

A. Current work

With input a set of calibrated videos capturing a river,the goal was to obtain a dynamic reconstruction. A frameis displayed on figure 6Running a traditional static MVSalgorithm based on a Markov Model, the result was notsatisfactory. The idea was then to take advantage of thetime coherence to improve the reconstruction, by extendingthe MVS algorithm with a temporal dimension. However,this extension came with a heavy computational cost, as thenumber of states to optimize exploded with the number offrames. The limitation of the temporal prior in this setting weresuch that the final reconstruction, despite being much slowerthan the static version, did not bring significant improvements.

Then we switched to a static feature based MVS. The outputfor each frame is cloud of points. The idea was to fit a surfacein a regression scheme to the cloud of points by imposing a 1st

and 2nd order smoothness prior, both spatially and temporally.The computation scales well and the obtained results wereboth coherent and visually pleasing as can be seen on figure6, with however no guarantee of correctness and not robust tooutliers.

Fig. 6. A frame of the captured river sequence and the refined reconstruction

B. Future work

Another way to classify the methods proposed in the in-troduction is by capture settings. We identified four kind ofcamera setup listed below.• Structure from Still The cameras are distributed sparsely

over space and capture the scene at the same time. Thisgroups the MVS techniques and is top left in fig 7.

• Structure from Video One or multiple moving videocameras records a scene by moving densely in space.Time is discarded as it is assumed that the scene is static.This groups the SLAM techniques and is top middle infig 7

• Dynamic structure from Video A fixed array of syn-chronized video cameras record a dynamic scene. Thestructure can be recovered at each frame, and may betracked over time if appearance is fixed or changesgradually. This groups the dynamical MVS systems andis bottom middle in fig 7

• Dynamic structure from Stills Cameras are distributedover space and time. This groups the techniques basedon collection of internet images and is bottom left in fig7.

Δs

x

Δs

x

Δs

x

Δs

x

Δs

x

Δs

x

Fig. 7. horizontal axe: spatial density of the cameras. Vertical axe: amountof structural changes in the scene

Let’s consider three data sets: a river sequence captured byan array of synchronized cameras, a construction site which iscaptured by videos camera every week, and finally a collectionof internet images of the Yosemite park.

1) River Sequence: Fluid coherence and surface process-ing: The river sequence fits in the third category. However, it isa special case as the appearance varies tremendously betweeneach frame. A better representation would be top right in fig7. This shows that a shape estimate can be recovered at eachframe independently. A model for the structural change canbe as simple as the one we are currently using, to a moreadvanced one, such as presented in the Fluid paper [31]. Thespecific nature of a river surface appearance requires specific


Fig. 8. Three images reprojected along a virtual camera of the Nevada fallof the Yosemite collection

modelling such as segmentation into specific regions (rocks,foam, floating objects) and water flow tracking.

2) Construction Site: Dynamic structure and appearancemodel: The construction site can be seen as a sequence ofstructure from Video, top right in fig 7. As the river sequence,it means that the structure can be recovered at each sparsetime frame. The challenge in this case is to find a suitablemodel to regularize the structure over time. Appearance couldalso be modelled to allow rendering at different time step withthe same appearance in the spirit of [12]. One could imaginghaving a final result with two sliders, one for appearance andone for structure, and to move them independently.

3) Collection of internet images: clustering to model thedynamics: The Yosemite sequence is a collection of internetimages. As pictures have been taken at different locationand time instances, implying structural changes in some partof the scene, it fits the fourth category. Current methodsbased on large collection of images discard the structural andappearance changes which result in coherent reconstructiononly at static places where appearance is similar enough. Toillustrate this, figure 8 shows three images which have beenautomatically registered to the yosemite 3D scene. We cannotice that they differe greatly in their appearance dependingon the season, time of day and camera used. The river structurevaries greatly in these three pictures, which is not taken inaccount in current algorithms. This as a starting point, theidea is to find a clustering of the images, for example bottomright in fig 7, which would allow to model the structural andappearance changes.

REFERENCES

[1] E. Aganj, J. Pons, and R. Keriven. Globally optimal spatio-temporalreconstruction from cluttered videos. Computer VisionACCV 2009, 2010.

[2] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. BuildingRome in a day. 2009 IEEE 12th International Conference on ComputerVision, pages 72–79, Sept. 2009.

[3] Y. Boykov and O. Veksler. Graph cuts in vision and graphics: Theoriesand applications. of Mathematical Models in Computer Vision, 1:100–119, 2006.

[4] M. a. Carreira-Perpinan. Gaussian mean-shift is an EM algorithm. IEEEtransactions on pattern analysis and machine intelligence, 29(5):767–76, May 2007.

[5] M. Carreira-Perpinan. Acceleration Strategies for Gaussian Mean-ShiftImage Segmentation. 2006 IEEE Computer Society Conference onComputer Vision and Pattern Recognition - Volume 1 (CVPR’06), pages1160–1167.

[6] D. Comaniciu and P. Meer. Mean shift: a robust approach toward featurespace analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 24(5):603–619, May 2002.

[7] J. Courchay, J. Pons, P. Monasse, and R. Keriven. Dense and Accu-rate Spatio-Temporal Multi-View Stereovision. Computer VisionACCV,pages 1–12, 2010.

[8] A. Davison. SLAM with a Single Camera. Workshop on ConcurrentMapping and Localization for, 2002.

[9] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, andS. Thrun. Performance capture from sparse multi-view video. ACMTransactions on Graphics, 27(3):1, Aug. 2008.

[10] M. Fashing and C. Tomasi. Mean shift is a bound optimization. IEEEtransactions on pattern analysis and machine intelligence, 27(3):471–4,Mar. 2005.

[11] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stere-opsis. IEEE transactions on pattern analysis and machine intelligence,32(8):1362–76, Aug. 2010.

[12] R. Garg, S. M. Seitz, and N. Snavely. The dimensionality of sceneappearance. 2009 IEEE 12th International Conference on ComputerVision, pages 1917–1924, Sept. 2009.

[13] B. Goldluecke and M. Magnor. Space-time isosurface evolution fortemporally coherent 3D reconstruction. Proceedings of the 2004 IEEEComputer Society Conference on Computer Vision and Pattern Recog-nition, 2004. CVPR 2004., pages 350–355, 2004.

[14] T. Hawkins, P. Einarsson, and P. Debevec. Acquisition of time-varyingparticipating media. ACM Transactions on Graphics, 24(3):812, July2005.

[15] C. Hernandez Esteban, G. Vogiatzis, and R. Cipolla. Multiview pho-tometric stereo. IEEE transactions on pattern analysis and machineintelligence, 30(3):548–54, Mar. 2008.

[16] P. Labatut, J.-P. Pons, and R. Keriven. Efficient Multi-View Reconstruc-tion of Large-Scale Scenes using Interest Points, Delaunay Triangulationand Graph Cuts. 2007 IEEE 11th International Conference on ComputerVision, pages 1–8, Oct. 2007.

[17] E. S. Larsen, P. Mordohai, M. Pollefeys, H. Fuchs, C. Hill, andN. Carolina. Temporally Consistent Reconstruction from Multiple VideoStreams Using Enhanced Belief Propagation. Evaluation, 1, 2007.

[18] R. Newcombe and J. Andrew. Live Dense Reconstruction with a SingleMoving Camera. CVPR, 2010.

[19] J. Ponce. Dense 3D motion capture from synchronized video streams,June 2008.

[20] J.-P. Pons, R. Keriven, and O. Faugeras. Multi-View Stereo Reconstruc-tion and Scene Flow Estimation with a Global Image-Based MatchingScore. International Journal of Computer Vision, 72(2):179–193, July2006.

[21] G. Schindler and F. Dellaert. Probabilistic Temporal Inference onReconstructed 3D Scenes. cc.gatech.edu, 2010.

[22] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. AComparison and Evaluation of Multi-View Stereo Reconstruction Algo-rithms. 2006 IEEE Computer Society Conference on Computer Visionand Pattern Recognition - Volume 1 (CVPR’06), pages 519–528, 2006.

[23] N. Snavely, R. Garg, S. M. Seitz, and R. Szeliski. Finding paths throughthe world’s photos. ACM Transactions on Graphics, 27(3):1, Aug. 2008.

[24] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the World fromInternet Photo Collections. International Journal of Computer Vision,80(2):189–210, Dec. 2007.

[25] J. Stam. Stable fluids. Proceedings of the 26th annual conference on,1999.

[26] C. Strecha, T. Pylvanainen, and P. Fua. Dynamic and scalable largescale image reconstruction. In Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on, pages 406–413. IEEE, 2010.

[27] J. Sun, H. Shum, and N. Zheng. Stereo matching using belief propaga-tion. Computer VisionECCV 2002, pages 450–452, 2002.

[28] N. Thurey, R. Keiser, M. Pauly, and U. Rude. Detail-preserving fluidcontrol. Graphical Models, 71(6):221–228, Nov. 2009.

[29] E. Tola, V. Lepetit, and P. Fua. DAISY: an efficient dense descriptorapplied to wide-baseline stereo. IEEE transactions on pattern analysisand machine intelligence, 32(5):815–30, May 2010.

[30] T. Tung, S. Nobuhara, and T. Matsuyama. Complete Multi-ViewReconstruction of Dynamic Scenes from Probabilistic Fusion of Narrowand Wide Baseline Stereo. (Iccv):1709–1716, 2009.

[31] H. Wang, M. Liao, Q. Zhang, R. Yang, and G. Physically guided liquidsurface modeling from videos. ACM Transactions on, 2009.

[32] Z. Zhang. A flexible new technique for camera calibration. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000.

edic research proposal 1 4d reconstruction: modelling the ... · a reconstruction is done and a...

Documents