visual modeling with a hand-held camera€¦ · 3d modeling in vision were robot navigation and...

26
International Journal of Computer Vision 59(3), 207–232, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Visual Modeling with a Hand-Held Camera MARC POLLEFEYS Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599-3175, USA [email protected] LUC VAN GOOL, MAARTEN VERGAUWEN, FRANK VERBIEST, KURT CORNELIS AND JAN TOPS Center for Processing of Speech and Images, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium [email protected] [email protected] [email protected] [email protected] [email protected] REINHARD KOCH Institut f ¨ ur Informatik und Praktische Mathematik, Christian-Albrechts-Universit¨ at Kiel, Hermann-Rodewald-Str. 3, D-24098 Kiel, Germany [email protected] Received March 14, 2001; Revised October 8, 2003; Accepted October 10, 2003 Abstract. In this paper a complete system to build visual models from camera images is presented. The system can deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the relations between multiple views are computed. From this both the structure of the scene and the motion of the camera are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A flexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the computed data different types of visual models are constructed. Besides the traditional geometry- and image-based approaches, a combined approach with view-dependent geometry and texture is presented. As an application fusion of real and virtual scenes is also shown. Keywords: visual modeling, structure-from-motion, projective reconstruction, self-calibration, multi-view stereo matching, dense reconstruction, 3D reconstruction, image-based rendering, augmented video, hand-held camera 1. Introduction During recent years a lot of effort was put in devel- oping new approaches for modeling and rendering vi- sual scenes. A few years ago the main applications of 3d modeling in vision were robot navigation and vi- sual inspection. Nowadays however the emphasis has changed. There is more and more demand for 3D mod- els in computer graphics, virtual reality and commu- nication. This results in a change in the requirements. The visual quality of the models becomes the main point of attention. There is an important demand for simple and flexible acquisition procedures. Therefore calibration should be absent or restricted to a minimum.

Upload: others

Post on 09-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

International Journal of Computer Vision 59(3), 207–232, 2004c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Visual Modeling with a Hand-Held Camera

MARC POLLEFEYSDepartment of Computer Science, University of North Carolina, Chapel Hill, NC 27599-3175, USA

[email protected]

LUC VAN GOOL, MAARTEN VERGAUWEN, FRANK VERBIEST, KURT CORNELIS AND JAN TOPSCenter for Processing of Speech and Images, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10,

B-3001 Leuven, [email protected]

[email protected]

[email protected]

[email protected]

[email protected]

REINHARD KOCHInstitut fur Informatik und Praktische Mathematik, Christian-Albrechts-Universitat Kiel,

Hermann-Rodewald-Str. 3, D-24098 Kiel, [email protected]

Received March 14, 2001; Revised October 8, 2003; Accepted October 10, 2003

Abstract. In this paper a complete system to build visual models from camera images is presented. The system candeal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features therelations between multiple views are computed. From this both the structure of the scene and the motion of the cameraare retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. Aflexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From thecomputed data different types of visual models are constructed. Besides the traditional geometry- and image-basedapproaches, a combined approach with view-dependent geometry and texture is presented. As an application fusionof real and virtual scenes is also shown.

Keywords: visual modeling, structure-from-motion, projective reconstruction, self-calibration, multi-view stereomatching, dense reconstruction, 3D reconstruction, image-based rendering, augmented video, hand-held camera

1. Introduction

During recent years a lot of effort was put in devel-oping new approaches for modeling and rendering vi-sual scenes. A few years ago the main applications of3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has

changed. There is more and more demand for 3D mod-els in computer graphics, virtual reality and commu-nication. This results in a change in the requirements.The visual quality of the models becomes the mainpoint of attention. There is an important demand forsimple and flexible acquisition procedures. Thereforecalibration should be absent or restricted to a minimum.

Page 2: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

208 Pollefeys et al.

Many new applications also require robust low cost ac-quisition systems. This stimulates the use of consumerphoto- or video cameras.

In this paper we present an approach that can beused to obtain several types of visual models from im-ages acquired with an uncalibrated camera.The user ac-quires the images by freely moving the camera aroundan object or scene. Neither the camera motion nor thecamera settings have to beknown a priori. The pre-sented approach can generate a textured 3D surfacemodel or alternatively render new views using a com-bined geometry- and image-based approach that usesview-dependent texture and geometry. The system canalso be used to combine virtual objects with real video,yielding augmented video sequences.

Other approaches for extracting 3D shape and tex-ture from image sequences acquired with a freelymoving camera have been proposed. The approach ofTomasi and Kanade (1992) used an affine factoriza-tion method to extract 3D from image sequences. Animportant restriction of this system is the assumptionof orthographic projection. Another type of approachstarts from an approximate 3D model and camera posesand refines the model based on images (e.g. Facade pro-posed by Debevec et al. (1996). The advantage is thatless images are required. On the other hand a prelimi-nary model must be available which mostly limits theapproach to man-made environments. This approachalso combines geometry- and image-based techniques,however only the texture is view-dependent.

The approach presented here combines many ideasand algorithms that have been developed in recentyears. This paper aims at consolidating these resultsby bringing them together and showing how they canbe combined to yield a complete visual modelingapproach. In Section 1.1 notations and background aregiven. The rest of the paper is then organized as follows.Section 2 discusses feature extraction and matchingand the computation of the multi-view relations.Section 3 deals with the structure and motion recovery,including self-calibration. In Section 4 the approachto obtain dense depth maps is presented and inSection 5 the construction of the different visual mod-els is discussed. The paper is concluded in Section 6.

1.1. Notations and Background

In this section we briefly introduce some of the geomet-ric concepts used throughout the paper. A more in depthdescription can be found in Faugeras et al. (2001) and

Hartley and Zisserman (2000). A perspective camerais modeled through the projection equation

m ∼ PM (1)

where ∼ represents the equality up to a non-zero scalefactor, M is a 4-vector that represents 3D world pointin homogeneous coordinates, similarly m is a 3-vectorthat represents a corresponding 2D image point and Pis a 3 × 4 projection matrix. In a metric or Euclideanframe P can be factorized as follows

P = KR�[ I |-t] where K =

f s u

r f v

1

(2)

contains the intrinsic camera parameters, R is a rotationmatrix representing the orientation and t is a 3-vectorrepresenting the position of the camera. The intrinsiccamera parameter f represents the focal length mea-sured in width of pixels, r is the aspect ratio of pixels,(u, v) represent the coordinates of the principal pointand s is a term accounting for the skew. In general scan be assumed zero. In practice, the principal point isoften close to the center of the image and the aspectratio r close to one. In many cases the camera doesnot perfectly satisfy the perspective projection modeland distortions have to be taken into account, the mostimportant being radial distortion. In practice, it is oftensufficient to model the radial distortion as follows:

m ∼ P(M) = KR(R�[ I |-t]M) with

R(x) = (1 + K1(x2 + y2))[x y 0]� + [0 0 1]� (3)

where K1 indicates the amount of radial distortion thatis present in the image. For high accuracy applica-tions more advanced models can be used (Slama, 1980;Willson, 1994).

In this paper the notation d(., .) will be used to in-dicate the Euclidean distance between entities in theimages.

Two View Geometry. The point m′ corresponding tothe point m in another image is bound to be on theprojection of its line of sight l′ ∼ Fm where F is thefundamental matrix for the two views under consid-eration. Therefore, the following equation should besatisfied for all corresponding points:

m′�Fm = 0. (4)

Page 3: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 209

The fundamental matrix has rank 2 and the right andleft null-space of F corresponds to the epipoles. Theepipoles e and e′ are the projections of the projec-tion center of one image in the other image. The

fundamental matrix can be obtained from two projec-tion matrices P and P′ as

F = (P′�)†P�[e]× (5)

where the epipole e = PC′ with C′ the solution ofP′C′ = 0.

Homographies. These can be used to transfer imagepoints that corresponds to 3D points that are on a spe-cific plane from one image to the other, i.e. m′ ∼ Hmwhere H is the homography that corresponds to thatplane (for the two views under consideration). There isan important relationship between such homographiesand the fundamental matrix:

F ∼ [e′]×H and H = [e′]×F − e′a� (6)

with [e′]× an anti-symmetric matrix representing thevector product with the epipole and with a a vec-tor related to the plane. Homographies for a planeL = [a b c d]� can also be obtained from projectionmatrices as

HLi i ′ = HLi ′H−1

Li with HLi ′ = Pi ′

[dI

[a b c]

](7)

From 3 points M1, M2 and M3 a plane is obtained as theright null space of [M1 M2 M3]�.

Comparing Images Regions. Image regions aretypically compared using sum-of-square-differences(SSD) or zero-mean normalized cross-correlation(ZNCC). Consider a window W in image I and a cor-responding region T(W ) in image J . The dissimilar-ity between two image regions based on SSD is givenby

D =∫ ∫

W[J (T(x, y)) − I (x, y)]2 w(x, y) dx dy

(8)

where w(x, y) is a weighting function that is definedover W . Typically, w(x, y) = 1 or it is a Gaussian. Thesimilarity measure between two image regions basedon ZNCC is given by

S =∫ ∫

W (J (T(x, y)) − J ) · (I (x, y) − I )w(x, y) dx dy√∫ ∫W (J (T(x, y)) − J )2w(x, y) dx dy ·

√∫ ∫W (I (x, y) − I )2w(x, y) dx dy

(9)

with J = ∫ ∫W J (T(x, y)) dx dy and I = ∫ ∫

W I (x, y)dx dy the mean image intensity in the consideredregion. Note that this last measure is invariant to globalintensity and contrast changes over the consideredregions.

2. Relating Images

Starting from a collection of images or a video se-quence the first step consists in relating the differentimages to each other. This is not an easy problem. Arestricted number of corresponding points is sufficientto determine the geometric relationship or multi-viewconstraints between the images. Since not all pointsare equally suited for matching or tracking (e.g. a pixelin a homogeneous region), the first step consist of se-lecting a number of interesting points or feature points.Some approaches also use other features, such as linesor curves, but these will not be discussed here. Depend-ing on the type of image data (i.e. video or still pictures)the feature points are tracked or matched and a numberof potential correspondences are obtained. From thesethe multi-view constraints can be computed. However,since the correspondence problem is an ill-posed prob-lem, the set of corresponding points can be contam-inated with an important number of wrong matchesor outliers. In this case, a traditional least-squares ap-proach will fail and therefore a robust method is needed.Once the multi-view constraints have been obtainedthey can be used to guide the search for additional cor-respondences. These can then be used to further refinethe results for the multi-view constraints.

2.1. Feature Extraction and Matching

One of the most important requirements for a fea-ture point is that it can be differentiated from itsneighboring image points. If this were not the case,it wouldn’t be possible to match it uniquely witha corresponding point in another image. Therefore,the neighborhood of a feature should be sufficiently

Page 4: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

210 Pollefeys et al.

different from the neighborhoods obtained after a smalldisplacement.

A second order approximation of the dissimilarity,as defined in Eq. (8), between an image window W anda slightly translated image window is given by

d(�x, �y) =[

�x

�y

]M[�x �y] with

M =∫ ∫

W

[∂ I∂x∂ I∂y

] [∂ I

∂x

∂ I

∂y

]w(x, y) dx dy

(10)

To ensure that no displacement exists for which Dis small, the eigenvalues of M should both be large.This can be achieved by enforcing a minimal valuefor the smallest eigenvalue (Shi and Tomasi, 1994) oralternatively for the following corner response func-tion R = det M − k(trace M)2 (Harris and Stephens,1988) where k is a parameter set to 0.04 (a sugges-tion of Harris). In the case of tracking this is sufficientto ensure that features can be tracked from one videoframe to the next. In this case it is natural to use thetracking neighborhood to evaluate the quality of a fea-ture (e.g. a 7 × 7 window with w(x, y) = 1). Trackingitself is done by minimizing Eq. (8) over the param-eters of T. For small steps a translation is sufficientfor T. To evaluate the accumulated difference from thestart of the track it is advised to use an affine motionmodel.

In the case of separate frames as obtained with astill camera, there is the additional requirement thatas much image points originating from the same 3Dpoints as possible should be extracted. Therefore, onlylocal maxima of the corner response function are con-sidered as suitable features. Sub-pixel precision canbe achieved through quadratic approximation of theneighborhood of the local maxima. A typical choicefor w(x) in this case is a Gaussian with σ = 0.7.Matching is typically done by comparing small, e.g.7 × 7, windows centered around the feature throughSSD or ZNCC. This measure is only invariant to imagetranslations and can therefore not cope with too largevariations in camera pose.

To match images that are more widely separated, itis required to cope with a larger set of image variations.Exhaustive search over all possible variations is com-putationally intractable. A more interesting approachconsists of extracting a more complex feature that notonly determines the position, but also the other un-

knowns of a local similarity (Schmid and Mohr, 1997)or affine transformation (Lowe, 1999; Tuytelaars andVan Gool, 2000).

2.2. Two View Geometry Computation

Even for an arbitrary geometric structure, the projec-tions of points in two views contain some structure.Finding back this structure is not only interesting to re-trieve information on the relative pose between the twoviews (Faugeras, 1992; Hartley et al., 1992), but alsoto eliminate mismatches and to facilitate the searchfor additional matches. This structure corresponds tothe epipolar geometry and is mathematically expressedby the fundamental matrix. Given a number of corre-sponding points Eq. (4) can be used to compute F. Thisequation can be rewritten in the following form:

[ xx ′ yx ′ x ′ xy′ yy′ y′ x y 1 ]f = 0 (11)

with m = [x y 1]�, m′ = [x ′y′1]� and f a vector con-taining the elements of the fundamental matrix. Stack-ing 8 or more of these equations allows to linearly solvefor the fundamental matrix. Even for 7 correspondingpoints the one parameter family of solutions obtainedby solving the linear equations can be restricted to 1or 3 solutions by enforcing the cubic rank-2 constraintdet (F1 + λF2) = 0. As pointed out by Hartley (1997) itis important to normalize the image coordinates beforesolving the linear equations. Otherwise the columns ofEq. (11) would differ by several orders of magnitudeand the error would concentrate on the coefficients cor-responding to the smaller columns. This normalizationcan be achieved by transforming the image center to theorigin and scaling the images so that the coordinateshave a standard deviation of unity. More advanced ap-proaches have been proposed (e.g. Matei and Meer,2000), but in practice the simple approach is sufficientto initialize a non-linear minimization. The result ofthe linear equations can be refined by minimizing thefollowing criterion (Zhang et al., 1995):

C(F) =∑

(d(m′, Fm)2 + d(m, F�m′)2) (12)

This criterion can be minimized through a Levenberg-Marquard algorithm (Press et al., 1992). An even betterapproach consists of computing a maximum-likelihoodestimation (MLE) by minimizing the following

Page 5: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 211

criterion:

C(F, m, m′) =∑

(d(m, m)2 + d(m′, m′)2) with

m′�Fm = 0 (13)

Although in this case the minimization has to be carriedout over a much larger set of variables, this can beachieved efficiently by taking advantage of the sparsityof the problem (see Section 3.3).

To compute the fundamental matrix from a set ofmatches that were automatically obtained from a pairof real images, it is important to explicitly deal withoutliers. If the set of matches is contaminated with evena small set of outliers, the result of the above methodcan become unusable. This is typical for all types ofleast-squares approaches (even non-linear ones). Theproblem is that the quadratic penalty (which is optimalfor Gaussian noise) allows for a single outlier that isvery far away from the true solution to completely biasthe final result (Torr, 1995).

The approach that is used to cope with this problem isthe RANSAC algorithm that was proposed by Fischlerand Bolles (1981). A minimal subset of the data israndomly selected and the solution obtained from it isused to segment the remainder of the dataset in “inliers”and “outliers”. If the initial set contains no outliers, itcan be expected that an important number of inlierswill support the solution, otherwise the initial subset isprobably contaminated with outliers. This procedure isrepeated until a satisfying solution is obtained. This canfor example be defined as a probability in excess of 95%that a good subsample was selected. The expression forthis probability is � = 1−(1−γ p)m with γ the fractionof inliers, and p the number of features in each sampleand m the number of trials (see Rousseeuw (1987)).

Once the epipolar geometry has been computed itcan be used to guide the matching procedure towardadditional matches. At this point only features thatare close to the epipolar line should be considered formatching. Table 1 summarizes the robust approach tothe determination of the two-view geometry.

3. Structure and Motion Recovery

In the previous section it was seen how different viewscould be related to each other. In this section the relationbetween the views and the correspondences betweenthe features will be used to retrieve the structure of thescene and the motion of the camera. This problem iscalled Structure and Motion.

Table 1. Overview of the two-view geometry computationalgorithm.

Step 1. Compute a set of potential matches

Step 2. While �(#inliers, #samples) < 95% do

Step 2.1 select minimal sample (7 matches)

Step 2.2 compute solutions for F

Step 2.3 determine inliers

Step 3. Refine F based on all inliers

Step 4. Look for additional matches

Step 5. Refine F based on all correct matches

The approach that is proposed here extends(Beardsley et al., 1997; Koch et al., 1999b) by be-ing fully projective and therefore not dependent onthe quasi-euclidean initialization. This was achieved bycarrying out all measurements in the images. This ap-proach provides an alternative for the triplet-based ap-proach proposed in Fitzgibbon and Zisserman (1998).An image-based measure that is able to obtain a quali-tative distance between viewpoints is also proposed tosupport initialization and determination of close views(independently of the actual projective frame).

At first two images are selected and an initial recon-struction frame is set-up. Then the pose of the camerafor the other views is determined in this frame andeach time the initial reconstruction is refined and ex-tended. In this way the pose estimation of views thathave no common features with the reference views alsobecomes possible. Typically, a view is only matchedwith its predecessor in the sequence. In most cases thisworks fine, but in some cases (e.g. when the cameramoves back and forth) it can be interesting to also re-late a new view to a number of additional views. Oncethe structure and motion has been determined for thewhole sequence, the results can be refined through aprojective bundle adjustment. Then the ambiguity willbe restricted to metric through self-calibration. Finally,a metric bundle adjustment is carried out to obtain anoptimal estimation of the structure and motion.

3.1. Initial Structure and Motion

The first step consists of selecting two views that aresuited for initializing the sequential structure and mo-tion computation. On the one hand it is important thatsufficient features are matched between these views, onthe other hand the views should not be too close to eachother so that the initial structure is well-conditioned.The first of these criteria is easy to verify, the

Page 6: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

212 Pollefeys et al.

second one is harder in the uncalibrated case. Theimage-based distance that we propose is the mediandistance between points transferred through an aver-age planar-homography and the corresponding pointsin the target image:

median{d(Hm j , m′j )} (14)

This planar-homography H is determined as followsfrom the matches between the two views:

H = [e]×F + ea�min with

amin = argmina

∑i

d(([e]×F + ea�)m j , m′j )

2 (15)

In practice the selection of the initial frame can be doneby maximizing the product of the number of matchesand the image-based distance defined above. When fea-tures are matched between sparse views, the evaluationcan be restricted to consecutive frames. However, whenfeatures are tracked over a video sequence, it is im-portant to consider views that are further apart in thesequence.

In the case of a video sequence where consecu-tive frames are very close together the computationof the epipolar geometry is ill conditioned. To avoidthis problem we propose to only consider properlyselected key-frames for the structure and motion re-covery. If it is important to compute the motion forall frames, such as for insertion of virtual objects ina video sequence (see Section 5.3), the pose for in-between frames can be computed afterward. We pro-pose to use model selection (Torr, 1998) to select thenext key-frame only once the epipolar geometry modelexplains the tracked features better than the simplerhomography model.1

Initial Frame. Two images of the sequence are usedto determine a reference frame. The world frame isaligned with the first camera. The second camera ischosen so that the epipolar geometry corresponds tothe retrieved F12:

P1 = [I3×3 | 03]

P2 = [[e12]×F12 + e12a� | σe12] (16)

Equation (16) is not completely determined by theepipolar geometry (i.e. F12 and e12), but has 4 moredegrees of freedom (i.e. a and σ ). a determines the po-sition of the reference plane (i.e. the plane at infinity inan affine or metric frame) and σ determines the global

scale of the reconstruction. The parameter σ can sim-ply be put to one or alternatively the baseline betweenthe two initial views can be scaled to one. In Beardsleyet al. (1997) it was proposed to set the coefficient of ato ensure a quasi-Euclidean frame, to avoid too largeprojective distortions. This was needed because not allparts of the algorithms where strictly projective. For thestructure and motion approach proposed in this papera can be arbitrarily set, e.g. a = [0 0 0]�.

Initializing Structure. Once two projection matriceshave been fully determined the matches can be recon-structed through triangulation. Due to noise the linesof sight will not intersect perfectly. In the uncalibratedcase the minimizations should be carried out in the im-ages and not in projective 3D space. Therefore, the dis-tance between the reprojected 3D point and the imagepoints should be minimized:

d(m1, P1M)2 + d(m2, P2M)2 (17)

It was noted by Hartley and Sturm (1997) that the onlyimportant choice is to select in which epipolar planethe point is reconstructed. Once this choice is made itis trivial to select the optimal point from the plane. Abundle of epipolar planes has only one parameter. Inthis case the dimension of the problem is reduced from3-dimensions to 1-dimension. Minimizing the follow-ing equation is thus equivalent to minimizing Eq. (17).

d(m1, l1(α))2 + d(m2, l2(α))2 (18)

with l1(α) and l2(α) the epipolar lines obtainedin function of the parameter α describing the bun-dle of epipolar planes. It turns out (see Hartley andSturm, 1997) that this equation is a polynomial of de-gree 6 in α. The global minimum of Eq. (18) can thuseasily be computed. In both images the point on theepipolar line l1(α) and l2(α) closest to the points m1

resp. m2 is selected. Since these points are in epipolarcorrespondence their lines of sight meet in a 3D point.

3.2. Updating the Structure and Motion

The previous section dealt with obtaining an initial re-construction from two views. This section discusseshow to add a view to an existing reconstruction. Firstthe pose of the camera is determined, then the struc-ture is updated based on the added view and finally newpoints are initialized.

Page 7: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 213

Figure 1. Image matches (mi−1, mi ) are found as described before. Since the image points, mi−1, relate to object points, Mi , the pose for view kcan be computed from the inferred matches (M, mi ). A point is accepted as an inlier if a solution for M exist for which d(PM, mi ) < 1 for eachview k in which M has been observed.

Projective Pose Estimation. For every additionalview the pose toward the pre-existing reconstructionis determined, then the reconstruction is updated. Thisis illustrated in Fig. 1. The first step consists of findingthe epipolar geometry as described in Section 2.2. Thenthe matches which correspond to already reconstructedpoints are used to infer correspondences between 2Dand 3D. Based on these the projection matrix Pk iscomputed using a robust procedure similar to the onelaid out in Table 1. In this case a minimal sample of6 matches is needed to compute Pk . A point is consid-ered an inlier if there exists a 3D point that projectssufficiently close to all associated image points. Thisrequires to refine the initial solution of M based on allobservations, including the last. Because this is com-putationally expensive (remember that this has to bedone for each generated hypothesis), it is advised to usea modified version of RANSAC that cancels the veri-fication of unpromising hypothesis (Chum and Matas,2002). Once Pk has been determined the projection ofalready reconstructed points can be predicted, so thatsome additional matches can be obtained. This meansthat the search space is gradually reduced from the fullimage to the epipolar line to the predicted projectionof the point.

This procedure only relates the image to the previ-ous image. In fact it is implicitly assumed that once apoint gets out of sight, it will not come back. Althoughthis is true for many sequences, this assumptions doesnot always hold. Assume that a specific 3D point gotout of sight, but that it is visible again in the last twoviews. In this case a new 3D point will be instantiated.This will not immediately cause problems, but sincethese two 3D points are unrelated for the system, noth-ing enforces their position to correspond. For longersequences where the camera is moved back and forthover the scene, this can lead to poor results due to ac-cumulated errors.

The solution that we propose is to match all the viewsthat are close with the actual view (as described inSection 2.2). For every close view a set of potential 2D-3D correspondences is obtained. These sets are mergedand the camera projection matrix is estimated using thesame robust procedure as described above, but on themerged set of 2D-3D correspondences (Koch et al.,1999; Sahwney et al., 1998).

Close views are determined as follows. First a planar-homography that explains best the image-motion offeature points between the actual and the previousview is determined (using Eq. (15)). Then, the median

Page 8: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

214 Pollefeys et al.

residual for the transfer of these features to other viewsusing homographies corresponding to the same planeare computed (see Eq. (14)). Since the direction of thecamera motion is given through the epipoles, it is pos-sible to limit the selection to the closest views in eachdirection. In this case it is important to take orienta-tion into account (Hartley, 1998; Laveau and Faugeras,1996) to differentiate between opposite directions.

Refining and Extending Structure. The structure isrefined using an iterated linear reconstruction algo-rithm on each point. Equation (1) can be rewritten tobecome linear in M:

P3Mx − P1M = 0P3My − P2M = 0

(19)

with Pi the i-th row of P and (x, y) being the image co-ordinates of the point. An estimate of M is computed bysolving the system of linear equations obtained from allviews where a corresponding image point is available.To obtain a better solution the criterion

∑d(PM, m)2

should be minimized. This can be approximately ob-tained by iteratively solving the following weightedlinear equations (in matrix form):

1

P3M

[P3x − P1

P3 y − P2

]M = 0 (20)

where M is the previous solution for M. This procedurecan be repeated a few times. By solving this systemof equations through SVD a normalized homogeneouspoint is automatically obtained. If a 3D point is not ob-served the position is not updated. In this case one cancheck if the point was seen in a sufficient number ofviews to be kept in the final reconstruction. This mini-mum number of views can for example be put to three.This avoids to have an important number of outliersdue to spurious matches.

Of course in an image sequence some new featureswill appear in every new image. If point matches areavailable that were not related to an existing point inthe structure, then a new point can be initialized as inSection 3.1.

After this procedure has been repeated for all the im-ages, one disposes of camera poses for all the views andthe reconstruction of the interest points. In the furthermodules mainly the camera calibration is used. The re-construction itself is used to obtain an estimate of thedisparity range for the dense stereo matching.

3.3. Refining Structure and Motion

Once the structure and motion has been obtained forthe whole sequence, it is recommended to refine itthrough a global minimization step. A maximum like-lihood estimation can be obtained through bundle ad-justment (Slama, 1980; Triggs et al., 2000). The goalis to find the parameters of the camera view Pk andthe 3D points M j for which the mean squared distancesbetween the observed image points mi j and the repro-jected image points Pi (Mi ) is minimized. The cameraprojection model should also take radial distortion intoaccount. For m views and n points the following crite-rion should be minimized:

minPi ,M j

m∑i=1

n∑j=1

d(mi j , Pi (M j ))2 (21)

If the errors on the localization of image features areindependent and satisfy a zero-mean Gaussian distri-bution then it can be shown that bundle adjustmentcorresponds to a maximum likelihood estimator. Thisminimization problem is huge, e.g. for a sequence of20 views and 100 points/view, a minimization problemin more than 6000 variables has to be solved (most ofthem related to the structure). A straight-forward com-putation is obviously not feasible. However, the specialstructure of the problem can be exploited to solve theproblem much more efficiently (Slama, 1980; Triggset al., 2000). The key reason for this is that a specificresidual is only dependent on one point and one camera,which results in a very sparse structure for the normalequations.

To conclude this section an overview of the algo-rithm to retrieve structure and motion from a sequenceof images is given. Two views are selected and a pro-jective frame is initialized. The matched corners arereconstructed to obtain an initial structure. The otherviews in the sequence are related to the existing struc-ture by matching them with their predecessor. Once thisis done the structure is updated. Existing points are re-fined and new points are initialized. When the cameramotion implies that points continuously disappear andreappear it is interesting to relate an image to otherclose views. Once the structure and motion has beenretrieved for the whole sequence, the results can be re-fined through bundle adjustment. The whole procedureis summarized in Table 2. The approach described hereassumes that the tracked features are not all coplanar.If this were to be the case, then some specific measuresare required (see Pollefeys et al., 2002).

Page 9: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 215

Table 2. Overview of the projective structure and motion algorithm.

Step 1. Match or track points over the whole image sequence.

Step 2. Initialize the structure and motion recovery

Step 2.1. Select two views that are suited for initialization.

Step 2.2. Relate these views by computing the two view geometry.

Step 2.3. Set up the initial frame.

Step 2.4. Reconstruct the initial structure.

Step 3. For every additional view

Step 3.1. Infer matches to the structure and compute the camera pose using a robust algorithm.

Step 3.2. Refine the existing structure.

Step 3.3. (optional) For already computed views which are “close”

3.3.1. Relate this view with the current view by finding feature matches and computing the two view geometry.

3.3.2. Infer new matches to the structure based on the computed matches and add these to the list used in Step 3.1.

Step 3.4. Refine the pose from all the matches using a robust algorithm.

Step 3.5. Initialize new structure points.

Step 4. Refine the structure and motion through bundle adjustment.

3.4. Upgrading to Metric

The reconstruction obtained as described in the pre-vious sections is only determined up to an arbi-trary projective transformation. This might be suf-ficient for some robotics or inspection applications,but certainly not for visualization. Therefore we needa method to upgrade the reconstruction to a met-ric one (i.e. determined up to an arbitrary Euclideantransformation and a scale factor). This can be doneby imposing some constraints on the intrinsic cam-era parameters. This approach that is called self-calibration has received a lot of attention in recentyears. Mostly self-calibration algorithms are concernedwith unknown but constant intrinsic camera parameters(Faugeras et al., 1992; Hartley, 1994; Pollefeys and VanGool, 1999; Heyden and Astrom, 1996; Triggs, 1997).Some algorithms for varying intrinsic camera para-meters have also been proposed (Pollefeys et al., 1999;Heyden and Astrom, 1997). In some cases the mo-tion of the camera is not general enough to allow forself-calibration to uniquely recover the metric structureand an ambiguity remains. More details can be foundin Sturm (1997a) for constant intrinsics and in Sturm(1999), Pollefeys (1999) and Kahl (1999) for varyingintrinsics.

The approach that is presented here was originallyproposed in Pollefeys et al. (1998) and later adaptedto take a priori information on the intrinsic camera pa-rameters into account which reduces the problem ofcritical motion sequences.

The Image of the Absolute Conic. One of the mostimportant concepts for self-calibration is the absoluteconic and its projection in the images. The simplestway to represent the absolute conic is through the dualabsolute quadric �∗ (Triggs, 1997). In a Euclidean co-ordinate frame �∗ = diag(1, 1, 1, 0) and one can easilyverify that it is invariant to similarity transformations.Inversely, it can also be shown that a transformationthat leaves the dual quadric �∗ = diag(1, 1, 1, 0) un-changed is a similarity transformation. For a projectivereconstruction �∗ can be represented by a 4×4 rank-3symmetric positive semi-definite matrix. According tothe properties mentioned above a transformation thattransforms �∗ → diag(1, 1, 1, 0) will bring the re-construction within a similarity transformation of theoriginal scene, i.e. yield a metric reconstruction.

The projection of the dual absolute quadric in theimage is described by the following equation:

ω∗ ∼ P�∗P�. (22)

It can be easily verified that in a Euclidean coordinateframe the image of the absolute quadric is directly re-lated to the intrinsic camera parameters:

ω∗ ∼ KK� (23)

Since the images are independent of the projective basisof the reconstruction, Eq. (23) is always valid and con-straints on the intrinsics can be translated to constraintson �∗.

Page 10: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

216 Pollefeys et al.

Linear Self-Calibration. The approach proposed inthis paper is inspired from Pollefeys (1999), however,some important improvements were made. A prioriknowledge about the parameters is introduced in thelinear computations (Pollefeys et al., 2002). This re-duces the problems with critical motion sequences(Sturm, 1997b; Pollefeys, 1999).

The first step consists of normalizing the projectionmatrices. The following normalization is proposed:

PN = K−1N P with

KN =

w + h 0 w2

w + h h2

1

(24)

where w and h are the width, resp. height of the im-age. After the normalization the focal length shouldbe of the order of unity and the principal point shouldbe close to the origin. The above normalization wouldscale a focal length of a 60 mm lens to 1 and thus fo-cal lengths in the range of 20 mm to 180 mm wouldend up in the range [ 1

3 , 3]. The aspect ratio is typicallyalso around 1 and the skew can be assumed 0 for allpractical purposes. Making these a priori knowledgemore explicit and estimating reasonable standard de-viations one could for example get f ≈ r f ≈ 1 ± 3,u ≈ v ≈ 0 ± 0.1, r ≈ 1 ± 0.1 and s = 0. It is nowinteresting to investigate the impact of this knowledgeon ω∗:

ω∗ ∼ KK� =

f 2 + s2 + u2 sr f + uv u

sr f + uv r2 f 2 + v2 v

u v 1

1 ± 9 ±0.01 ±0.1

±0.01 1 ± 9 ±0.1

±0.1 ±0.1 1

(25)

and ω∗22/ω

∗11 ≈ 1 ± 0.2. The constraints on the left-hand

side of Eq. (22) should also be verified on the right-hand side (up to scale). The uncertainty can be takeinto account by weighting the equations accordingly.

1

9ν(P1�

∗ P1� − P3�

∗ P3�) = 0

1

9ν(P2�

∗ P2� − P3�

∗ P3�) = 0

1

0.2ν(P1�

∗ P1� − P2�

∗ P2�) = 0 (26)

1

0.1ν(P1�

∗ P2�) = 0

1

0.1ν(P1�

∗ P3�) = 0

1

0.01ν(P2�

∗ P3�) = 0

with Pi the i th row of P and ν a scale factor that isinitially set to 1 and later on to P3�

∗ P3� with �∗ the

result of the previous iteration. Since �∗ is a symmet-ric 4 × 4 matrix it is parametrized through 10 coeffi-cients. An estimate of the dual absolute quadric �∗

can be obtained by solving the above set of equa-tions for all views through linear least-squares. Therank-3 constraint should be imposed by forcing thesmallest singular value to zero. This scheme can beiterated until the ν factors converge (typically after afew iterations). In most cases, however, the first itera-tion is sufficient to initialize the metric bundle adjust-ment. The upgrading transformation T can be obtainedfrom diag (1, 1, 1, 0) = T�∗T� by decompositionof �∗.

The metric structure and motion is then obtainedas

PM = PT−1 and MM = TM (27)

Refinement. This initial metric reconstruction shouldthen further be refined through bundle adjustment to ob-tain the best possible results. While some approachessuggest an intermediate non-linear refinement of theself-calibration, our experience shows that this is ingeneral not necessary if one uses the self-calibrationapproach presented in the previous paragraph (as wellas an initial correction of the radial distortion). Forthis bundle adjustment procedure the camera projectionmodel should explicitly represent the constraints on thecamera intrinsics. These constraints can both be hardconstraints (imposed through parametrization) or softconstraints (imposed by including an additional termin the minimization criterion). A good typical choiceof constraints for a photo camera consists of impos-ing a constant focal length (if no zoom was used), aconstant principal point and radial distortion, an aspectratio of one and the absence of skew. However, for acamcorder/video camera it is important to estimate the(constant) aspect ratio as this can significantly differfrom one.

Page 11: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 217

4. Dense Surface Estimation

With the camera calibration given for all viewpointsof the sequence, we can proceed with methods devel-oped for calibrated structure from motion algorithms.The feature tracking algorithm already delivers a sparsesurface model based on distinct feature points. Thishowever is not sufficient to reconstruct geometricallycorrect and visually pleasing surface models. This taskis accomplished by a dense disparity matching that es-timates correspondences from the images by exploitingadditional geometrical constraints. The dense surfaceestimation is done in a number of steps. First imagepairs are rectified to the standard stereo configuration.Then disparity maps are computed through a stereomatching algorithm. Finally a multi-view approach in-tegrates the results obtained from several view pairs.

4.1. Rectification

Since the calibration between successive image pairswas computed, the epipolar constraint that restricts thecorrespondence search to a 1-D search range can be ex-ploited. Image pairs are warped so that epipolar linescoinciding with the image scan lines. The correspon-dence search is then reduced to a matching of the im-age points along each image scan-line. This results ina dramatic increase of the computational efficiency ofthe algorithms by enabling several optimizations in thecomputations.

For some motions (i.e. when the epipole is located inthe image) standard rectification based on planar homo-graphies is not possible and a more advanced procedureshould be used. The approach used in the presentedsystem was proposed in Pollefeys et al. (1999). Themethod combines simplicity with minimal image sizeand works for all possible motions. The key idea is to

Figure 2. Original image pair (left) and rectified image pair (right).

use polar coordinates with the epipole as origin. Corre-sponding lines are given through the epipolar geometry.By taking the orientation (Laveau and Faugeras, 1996)into account the matching ambiguity is reduced to halfepipolar lines. A minimal image size is achieved bycomputing the angle between two consecutive epipo-lar lines that correspond to rows in the rectified imagesto have the worst case pixel on the line preserve its area.To avoid image degradation, both correction of radialdistortion and rectification are performed in a singleresampling step.

Some Examples. A first example comes from the cas-tle sequence. In Fig. 2 an image pair and the associ-ated rectified image pair are shown. A second examplewas filmed with a hand-held digital video camera inthe Beguinage in Leuven. Due to the narrow streetsonly forward motion is feasible. In this case the fulladvantage of the polar rectification scheme becomesclear since this sequence could not have been handledthrough traditional planar rectification. An example ofa rectified image pair is given in Fig. 3. Note that thewhole left part of the rectified images corresponds tothe epipole. On the right side of this figure a model thatwas obtained by combining the results from severalimage pairs is shown.

4.2. Stereo Matching

The goal of a dense stereo algorithm is to compute cor-responding pixel for every pixel of an image pair. Afterrectification the correspondence search is limited tocorresponding scanlines. As illustrated in Fig. 4, find-ing the correspondences for a pair of scanlines can beseen as a path search problem. Besides the epipolargeometry other constraints, like preserving the orderof neighboring pixels, bidirectional uniqueness of thematch, and detection of occlusions can be exploited.

Page 12: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

218 Pollefeys et al.

Figure 3. Rectified image pair (left) and some views of the reconstructed street model (right).

Figure 4. Illustration of the ordering constraint (left), dense matching as a path search problem (right).

Page 13: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 219

In most cases it is also possible to limit the search toa certain disparity range (an estimate of this range canbe obtained from the reconstructed 3D feature points).Besides these constraints, a stereo algorithm shouldalso take into account the similarity between corre-sponding points and the continuity of the surface. Itis possible to compute the optimal path taking all theconstraints into account using dynamic programming(Cox et al., 1996; Falkenhagen, 1997; Van Meerbergenet al., 2002). While many other stereo approaches areavailable, we use this one because it provides a goodtrade-off between quality and speed. Computation ofa disparity map between two video frame with a dis-parity range of hundred, including polar rectification,takes less than a minute on a PC. By treating pixels in-dependently stereo can be performed much faster (evenreal-time), but at the expense of quality. Some recentstereo approaches perform a global optimization thatalso takes continuity across scanlines into account andtherefore achieve better results, but are much slower.A good taxonomy of stereo algorithms can be found inScharstein and Szeliski (2002). Notice that because ofthe modular design of our 3D reconstruction pipeline,it is simple to substitute one stereo algorithm foranother.

4.3. Multi-View Linking

The pairwise disparity estimation allows to computeimage to image correspondence between adjacent rec-tified image pairs, and independent depth estimates foreach camera viewpoint. An optimal joint estimate isachieved by fusing all independent estimates into acommon 3D model. The fusion can be performed inan economical way through controlled correspondence

Figure 5. Depth fusion and uncertainty reduction from correspondence linking (left), linking stops when an outlier is encountered (right).

linking (see Fig. 5). A point is transferred from one im-age to the next image as follows:

m′ = R′−1(R(m) + D(R(m))) (28)

with R(·) and R′(·) functions that map points from theoriginal image into the rectified image and D(·) a func-tion that corresponds to the disparity map. When thedepth obtained from the new image point m′ is outsidethe confidence interval the linking is stopped, otherwisethe result is fused with the previous values through aKalman filter. This approach is discussed into moredetail in Koch et al. (1998). This approach combinesthe advantages of small baseline and wide baselinestereo. It can provide a very dense depth map by avoid-ing most occlusions. The depth resolution is increasedthrough the combination of multiple viewpoints andlarge global baseline while the matching is simplifiedthrough the small local baselines. Due to multiple ob-servations of a single surface points the texture can beenhanced and noise and highlights can be removed.

Starting from the computed structure and motionalternative approaches such as sum-of-sum-of-square-differences (SSSD) (Okutomi and Kanade, 1993) orspace-carving (Kutulakos and Seitz, 2000) could alsobe used. The advantages of an approach that only usespairwise matching followed by multi-view linking, isthat it is more robust to changes in camera exposure,non-Lambertian surfaces, passers-by, etc. This is im-portant for obtaining good quality results using hand-held camera sequences recorded in an uncontrolled en-vironment.

Some Results. The quantitative performance of cor-respondence linking can be tested in different ways.One measure already mentioned is the visibility of an

Page 14: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

220 Pollefeys et al.

object point. In connection with correspondence link-ing, we have defined visibility V as the number of viewslinked to the reference view. Another important fea-ture of the algorithm is the density and accuracy ofthe depth maps. To describe its improvement over the2-view estimator, we define the fill rate F and the aver-age relative depth error E as additional measures. The2-view disparity estimator is a special case of the pro-posed linking algorithm, hence both can be comparedon an equal basis. Figure 6 displays visibility and rel-ative depth error for sequences from 2 to 15 images ofthe castle sequence, chosen symmetrically around thereference image. The average visibility V shows thatfor up to 5 images nearly all images are utilized. For15 images, at average 9 images are linked. The amountof linking is reflected in the relative depth error thatdrops from 5% in the 2 view estimator to about 1.2%for 15 images.

Linking two views is the minimum case that allowstriangulation. To increase the reliability of the esti-mates, a surface point should be observed in more thantwo images. We can therefore impose a minimum vis-ibility Vmin on a depth estimate. This will reject unreli-able depth estimates effectively, but will also reduce thefill rate of the depth map. The graphs in Fig. 6 (center)show the dependency of the fill rate and depth error onminimum visibility for the sequence length N = 11.The fill rate drops from 92 to about 70%, but at the sametime the depth error is reduced to 0.5% due to outlier

Figure 6. Statistics of the castle sequence. Influence of sequence length N on visibility V and relative depth error E . (left) Influence ofminimum visibility Vmin on fill rate F and depth error E for N = 11 (center). Depth map (above: dark = near, light = far) and error map (below:dark = large error, light = small error) for N = 11 and Vmin = 3 (right).

rejection. The depth map and the relative error distri-bution over the depth map is displayed in Fig. 6 (right).The error distribution shows a periodic structure thatin fact reflects the quantization uncertainty of the dis-parity resolution when it switches from one disparityvalue to the next.

5. Visual Scene Representations

In the previous sections a dense structure and motionrecovery approach was given. This yields all the nec-essary information to build different types of visualmodels. In this section several types of models will beconsidered. First, the construction of texture-mapped3D surface models is discussed. Then, a combinedimage- and geometry-based approach is presented thatcan render models ranging from pure plenoptic toview-dependent texture and geometry models. Finally,the possibility of fusion of real and virtual scenes invideo is also treated. These different cases will now bediscussed in more detail.

5.1. 3D Surface Reconstruction

The 3D surface is approximated by a triangular meshto reduce geometric complexity and to tailor the modelto the requirements of computer graphics visualizationsystems. A simple approach consists of overlaying a

Page 15: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 221

2D triangular mesh on top of one of the imagesand then build a corresponding 3D mesh by placingthe vertices of the triangles in 3D space according tothe values found in the corresponding depth map. To re-duce noise it is recommended to first smooth the depthimage (the kernel can be chosen of the same size as themesh triangles). The image itself can be used as texturemap (the texture coordinates are trivially obtained asthe 2D coordinates of the vertices).

It can happen that for some vertices no depth valueis available or that the confidence is too low. In thesecases the corresponding triangles are not reconstructed.The same happens when triangles are placed over dis-continuities. This is achieved by selecting a maximumangle between the normal of a triangle and the line-of-sight through its center (e.g. 85 degrees). This simpleapproach works very well on the dense depth maps asobtained through multi-view linking. The surface re-construction approach is illustrated in Fig. 7. The tex-ture can be enhanced through the multi-view linking

Figure 7. Surface reconstruction approach (top): A triangular mesh is overlaid on top of the image. The vertices are back-projected in spaceaccording to the depth values. From this a 3D surface model is obtained (bottom).

scheme. A median or robust mean of the correspond-ing texture values is computed to discard imaging ar-tifacts like sensor noise, specular reflections and high-lights (Ofek et al., 1997). Ideally, to avoid artifacts onstretched triangles (such as the ground in Fig. 7) pro-jective texture mapping has to be used in stead of thestandard affine mapping (Debevec et al., 1998).

To reconstruct more complex shapes it is necessaryto combine results from multiple depth maps. The sim-plest approach consists of generating separate modelsindependently and then loading them together in thegraphics system. Since all depth-maps can be locatedin a single metric frame, registration is not an issue.For more complex scenes different meshes can be in-tegrated into a single surface representation. Differentapproaches have been proposed to deal with this prob-lem. These can broadly be classified in surface-basedapproaches (Soucy and Laurendeau, 1995; Turk andLevoy, 1994) and volumetric approaches (Curless andLevoy, 1996; Wheeler et al., 1998). In our approach we

Page 16: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

222 Pollefeys et al.

have implemented (Curless and Levoy, 1996) to obtainan implicit representation of the surface, followed bythe marching cubes algorithm to obtain an explicit meshrepresentation (Lorensen and Cline, 1987) and finallywe apply a variant of Schroeder et al. (1992) to simplifythe mesh.

Example. A first example was recorded using a con-sumer camcorder (Sony TRV900). A 20 second shotwas made of a Medusa head located on the entablatureof a monumental fountain in the ancient city of Sagalas-sos (Turkey). The head itself is about 30 cm across. Us-ing progressive-scan frames of 720×576 are obtained,an example is shown on the upper-left of Fig. 8. Key-frames are automatically selected and the structure ofthe tracked features and the motion and calibration ofthe camera is computed, see upper-right of Fig. 8. Itis interesting to notice that for this camera the aspectratio is actually not 1, but around 1.09 which can beobserved by comparing the upper-left and the lower-left image in Fig. 8 (notice that it is the real picturethat is unnaturally stretched vertically). The next stageconsisted of computing a dense surface representation.To this effect stereo matching was performed for allpairs of consecutive key-frames. Using our multi-viewlinking approach a dense depth map was computed fora central frame and the corresponding image was ap-plied as a texture. Several views of the resulting modelare shown in Fig. 8. The shaded view allows to ob-serve the high-quality of the recovered geometry. Wehave also performed a more quantitative evaluation ofthe results. The accuracy of the reconstruction shouldbe considered at two levels. Errors on the camera mo-tion and calibration computations can result in a globalbias on the reconstruction. From the results of the bun-dle adjustment we can estimate this error to be of theorder of 3 mm for points on the reconstruction. Thedepth computations indicate that 90% of the recon-structed points have a relative error of less than 1 mm.Note that the stereo correlation uses a 7 × 7 windowwhich corresponds to a size of 5 × 5 mm on the objectand therefore the measured depth will typically corre-spond to the dominant visual feature within that patch.

Our second example was also recorded on the ar-chaeological site of Sagalassos. We took a sequence ofabout 25 photographs of the excavations of a Romanvilla at different moments in time so that we could re-construct the different layers of the stratigraphy. In thiscase a reconstruction based on a single depth map isnot feasible and we have used the volumetric approach

explained above to extract a single 3D mesh represent-ing the complete stratigraphic layer within the sectorof interest. Two of these layers are shown in Fig. 9. Thepossibility to acquire 3D records using cheap consumerproducts, the absence of calibration procedure and theflexibility and limited time needed for acquisition makeour approach particularly suitable for archaeologists.

5.2. Combined Image- and Geometry-BasedScene Visualization

In this section a different approach is taken to the vi-sualization of 3D scenes. An image-based approach isproposed that can efficiently deal with hand-held cam-era images. An underlying view-dependent geometryis used to minimize artifacts during visualization. Thisapproach avoids the need for a globally consistent 3Dsurface representation. This enables to render realisticviews of more complex scenes and to reproduce visualeffects such as highlights and reflections during render-ing. A more in depth discussion of this approach canbe found in the following papers (Koch et al., 1999a,1999b; Heigl et al., 1999).

For rendering new views two major concepts areknown in literature. The first one is the geometry basedconcept. The scene geometry is reconstructed froma stream of images and a single texture is synthe-sized which is mapped onto this geometry. For thisapproach, a limited set of camera views is sufficient,but view-dependent effects such as specularities cannot be handled appropriately. This approach was dis-cussed in the previous section. The second major con-cept is image-based rendering. This approach modelsthe scene as a collection of views all around the scenewithout an exact geometrical representation (Levoy andHanrahan, 1996), New (virtual) views are renderedfrom the recorded ones by interpolation. Optionallyapproximate geometrical information can be used toimprove the results (Gortler et al., 1996). It was shownthat this can greatly reduce the required amount of im-ages (Chai et al., 2000).

Plenoptic Modeling and Rendering. In McMillanand Bishop (1995) the appearance of a scene is de-scribed through all light rays (2D) that are emitted fromevery 3D scene point, generating a 5D radiance func-tion. Subsequently two equivalent realizations of theplenoptic function were proposed in form of the light-field (Levoy and Hanrahan,1996), and the lumigraph(Gortler et al., 1996). They handle the case when the

Page 17: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 223

Figure 8. Reconstruction of ancient Medusa head: video frame and recovered structure and motion for key-frames (top), textured and shadedview of 3D reconstruction (middle), frontal view and detailed view (bottom).

observer and the scene can be separated by a surface.Hence the plenoptic function is reduced to four dimen-sions. The radiance is represented as a function of lightrays passing through the separating surface. To createsuch a plenoptic model for real scenes, a large num-ber of views is taken. These views can be consideredas a collection of light rays with corresponding color

values. They are discrete samples of the plenoptic func-tion. The light rays which are not represented have tobe interpolated from recorded ones considering addi-tional information on physical restrictions. Often realobjects are supposed to be Lambertian, meaning thatone point of the object has the same radiance value inall possible directions. This implies that two viewing

Page 18: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

224 Pollefeys et al.

Figure 9. Excavation of Roman villa: front and top view of two different stratigraphic layers.

rays have the same color value, if they intersect at a sur-face point. If specular effects occur, this is not true anymore. Two viewing rays then have similar color valuesif their direction is similar and if their point of intersec-tion is near the real scene point which originates theircolor value. To render a new view we suppose to havea virtual camera looking at the scene. We determinethose viewing rays which are nearest to those of thiscamera. The nearer a ray is to a given ray, the greateris its support to the color value.

The original 4D lightfield (Levoy and Hanrahan,1996) data structure employs a two-plane parameter-ization. Each light ray passes through two parallelplanes with plane coordinates (s, t) and (u, v). The(u, v)-plane is the viewpoint plane in which all camerafocal points are placed on regular grid points. The (s, t)-plane is the focal plane. New views can be rendered byintersecting each viewing ray of a virtual camera withthe two planes at (s, t, u, v). The resulting radiance isa look-up into the regular grid. For rays passing in be-tween the (s, t) and (u, v) grid coordinates an interpo-lation is applied that will degrade the rendering qualitydepending on the scene geometry. In fact, the lightfieldcontains an implicit geometrical assumption, i.e. thescene geometry is planar and coincides with the focalplane. Deviation of the scene geometry from the focalplane causes image degradation (i.e. blurring or ghost-ing). To use hand-held camera images, the solution pro-posed in Gortler et al. (1996) consists of rebinning theimages to the regular grid. The disadvantage of this re-binning step is that the interpolated regular structure

already contains inconsistencies and ghosting artifactsbecause of errors in the scantily approximated geome-try. During rendering the effect of ghosting artifacts isrepeated so duplicate ghosting effects occur.

Rendering from Recorded Images. Our goal is toovercome the problems described in the last section byrelaxing the restrictions imposed by the regular light-field structure and to render views directly from the cal-ibrated sequence of recorded images with use of localdepth maps. Without loosing performance the originalimages are directly mapped onto one or more planesviewed by a virtual camera.

To obtain a high-quality image-based scene repre-sentation, we need many views from a scene from manydirections. For this, we can record an extended imagesequence moving the camera in a zigzag like manner.The camera can cross its own moving path several timesor at least gets close to it. To obtain a good qualitystructure-and-motion estimation from this type of se-quence it is important to use the extensions proposedin Section 3.2 to match close views that are not pre-decessors or successors in the image stream. To allowto construct the local geometrical approximation depthmaps should also be computed as described in the pre-vious section.

Fixed Plane Approximation. In a first approach, weapproximate the scene geometry by a single planeL by minimizing the least square error. We map allgiven camera images onto plane L and view it through

Page 19: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 225

a virtual camera. This can be achieved by directly map-ping the coordinates xi , yi of image i onto the vir-tual camera coordinates [xV yV 1]� = HiV [xi yi 1]�.Therefore we can perform a direct look-up into theoriginally recorded images and determine the radianceby interpolating the recorded neighboring pixel val-ues. This technique is similar to the lightfield approach(Levoy and Hanrahan, 1996) which implicitly assumesthe focal plane as the plane of geometry. Thus to con-struct a specific view we have to interpolate betweenneighboring views. Those views give the most supportto the color value of a particular pixel whose projec-tion center is close to the viewing ray of this pixel.This is equivalent to the fact that those views whoseprojected camera centers are close to its image coor-dinate give the most support to a specified pixel. Werestrict the support to the nearest three cameras (seeFig. 10). We project all camera centers into the virtualimage and perform a 2D Delaunay triangulation. Thenthe neighboring cameras of a pixel are determined bythe corners of the triangle which this pixel belongs to.Each triangle is drawn as a sum of three triangles. Foreach camera we look up the color values in the origi-nal image like described above and multiply them withweight 1 at the corresponding vertex and with weight 0at both other vertices. In between, the weights are inter-

Figure 10. Drawing triangles of neighboring projected camera centers and approximating geometry by one plane for the whole scene, for onecamera triple or by several planes for one camera triple.

polated linearly similar to the Gouraud shading. Withinthe triangle the sum of weights is 1 at each point. Thetotal image is built up as a mosaic of these triangles.

View-Dependent Geometry Approximation. The re-sults can be improved if depth maps are available. Adepth map provides the distance between the projectioncenter and the scene for every pixel. To avoid a runtimepenalty a diffusion algorithm is used to fill in undefineddepth values. From the depth maps we can calculate theapproximating plane of geometry for each triangle ofthe actual view. This is simple because the points in theoriginal images corresponding to the triangle verticesare easily computed (i.e. the projection of the virtualcamera center in the original views). The 3D point Mi

which corresponds to view i can be calculated as

Mi = Di (PiCV )Ci − CV

‖Ci − CV ‖ + Ci (29)

where Di (·) is a function representing the depthmap (ifthe virtual camera moves in front of the real camerasthe sign of the first term has to be reversed). We can in-terpret the points Mi as the intersection of the line CVCi

with the scene geometry. Knowing the 3D coordinatesof triangle corners, we can define a plane through them

Page 20: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

226 Pollefeys et al.

and apply the same rendering technique as describedabove.

Finally, if inside the triangle the scene is not well ap-proximated by a plane, the triangle can be subdividedinto four sub-triangles by splitting the three sides intotwo parts, each. For each of these sub-triangles, a sep-arate approximative plane is calculated in the abovemanner. Note that the computation of the depth be-comes slightly more complicated because for pointsother than the epipoles the correspondence betweenview i and V depends on the depth which has thereforeto be computed iteratively. Further subdivision can bedone in the same manner to improve accuracy. In fact,using the same approach extrapolation also becomesfeasible. Conceptually it is sufficient to add the fourimage corners to the points for which the Delaunaytriangulation is computed and then use subdivision tofurther refine the approximation. In practice, however,results degrade rapidly as one moves away from theoriginal viewpoints. Adding the image corners in thetriangulation can also be used to solve the degeneracythat occurs when the virtual camera passes through themesh formed by the real camera positions. This extrap-olation approach was proposed by Buehler et al. (2001).

Hardware Accelerated Rendering. We have imple-mented our approach so that it takes advantage of theavailable graphics hardware. In doing so real-time per-formance is easily achieved. Triangulation and depthlook-up are performed on the CPU, while the GPU’sprojective texture mapping and alpha blending func-tionality are used to build up the image. The level ofsubdivision can be fixed, or subdivision can be per-formed adaptively depending on the observed depthvariation within a triangle.

In this section, we have shown how the proposedapproach for modeling from images could easily beextended to allow the rendering of novel views usinga plenoptic or view-dependent texture/geometry repre-sentation. The quality of rendered images can be variedby adjusting the resolution of the considered scene ge-ometry. This approach has been described in Koch et al.(1999a, 1999b), Heigl et al. (1999).

Example. We have tested our approaches with an im-age sequence of 187 images showing an office scene.Figure 11 (top-left) shows one particular image. A dig-ital consumer video camera was swept freely over acluttered scene on a desk, covering a viewing surface ofabout 1 m2. Figure 11 (top-right) shows the calibration

result. A result of a rendered view is shown in themiddle-left part of the figure. The middle-right partillustrates the success of the extended structure-and-motion algorithm. Features that are lost are picked upagain when they reappear in the images. In the lowerpart of Fig. 11 a detail of a view is shown for the dif-ferent methods. In the case of one global plane (leftimage), the reconstruction is sharp where the approxi-mating plane intersects the actual scene geometry. Thereconstruction is blurred where the scene geometrydiverges from this plane. In the case of local planes(middle image), at the corners of the triangles, the re-construction is almost sharp, because there the scenegeometry is considered directly. Within a triangle,ghosting artifacts occur where the scene geometry di-verges from the particular local plane. If these trianglesare subdivided (right image) these artifacts are reducedfurther.

5.3. Combining Real and Virtual Scenes

Another interesting possibility offered by the presentedapproach is to combine real and virtual scene elements.This allows us to augment real environments with vir-tual objects. A first approach consists of virtualizingthe real environment and then to place virtual objectsin it. This can readily be done using the techniques pre-sented in Section 5.1. An example is shown in Fig. 12.The landscape of Sagalassos (an archaeological site inTurkey) was modeled from a dozen photographs takenfrom a nearby hill. Virtual reconstructions of ancientmonuments have been made based on measurementsand hypotheses of archaeologists. Both could then becombined in a single virtual world.

Augmenting Video Footage. Another challengingapplication consists of seamlessly merging virtual ob-jects with real video. In this case the ultimate goal isto make it impossible to differentiate between real andvirtual objects. Several problems need to be overcomebefore achieving this goal. Among them are the rigidregistration of virtual objects into the real environment,the problem of mutual occlusion of real and virtual ob-jects and the extraction of the illumination distributionof the real environment in order to render the virtualobjects with this illumination model.

Here we will concentrate on the first of theseproblems, although the computations described inthe previous section also provide most of the nec-essary information to solve for occlusions and other

Page 21: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 227

Figure 11. Top: Image of the desk sequence and sparse structure-and-motion result (left) and view rendered using adaptive subdivision (right).Middle: Details of rendered images showing multiple levels of geometric refinement. Bottom: Rendering of a view-dependent effect.

interactions between the real and virtual componentsof the augmented scene. Accurate registration of vir-tual objects into a real environment is still a challeng-ing problems. Systems that fail to do so will fail togive the user a real-life impression of the augmentedoutcome. Since our approach does not use markers or

a-priori knowledge of the scene or the camera, thisallows us to deal with video footage of unprepared en-vironments or archive video footage. More details onthis approach can be found in Cornelis et al. (2001).The software Boujou and MatchMover commercial-ized by 2D3 and RealViz respectively (and inspired

Page 22: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

228 Pollefeys et al.

Figure 12. Virtualized landscape of Sagalassos combined with virtual reconstructions of monuments.

by Oxford and INRIA respectively) follow a similarapproach.

An important difference with the applications dis-cussed in the previous sections is that in this case allframes of the input video sequence have to be processedwhile for 3D modeling often a sparse set of views issufficient. Therefore, in this case features should betracked from frame to frame. As already mentioned inSection 3.1 it is important that the structure is initial-ized from frames that are sufficiently separated. An-other key component is the bundle adjustment. It doesnot only reduce the frame to frame jitter, but removesthe largest part of the error that the structure and motionapproach accumulates over the sequence. According toour experience it is very important to extend the per-spective camera model with at least one parameter forradial distortion to obtain an undistorted metric struc-ture (this will clearly be demonstrated in the example).Undistorted models are required to position larger vir-tual entities correctly in the model and to avoid driftof virtual objects in the augmented video sequences.Note however that for the rendering of the virtual ob-jects the computed radial distortion can most often be

ignored (except for sequences where radial distortionis immediately noticeable from single images).

Examples. This example was recorded at Sagalassosin Turkey, where the footage of the ruins of an ancientfountain was taken. The fountain video sequence con-sists of 250 frames. A large part of the original mon-ument is missing. Based on results of archaeologicalexcavations and architectural studies, it was possibleto generate a virtual copy of the missing part. Usingthe proposed approach the virtual reconstruction couldbe placed back on the remains of the original mon-ument, at least in the recorded video sequence. Thismaterial is of great interest to the archaeologists, notonly for education and dissemination, but also for fundraising to achieve a real restoration of the fountain.The top part of Fig. 13 shows a top view of the re-covered structure before and after bundle-adjustment.Besides the larger reconstruction error it can also benoticed that the non-refined structure is slightly bent.This effect mostly comes from not taking the radialdistortion into account in the initial structure recov-ery. Therefore, a bundle adjustment that did not model

Page 23: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 229

Figure 13. Fusion of real and virtual fountain parts. Top: Top view of the recovered structure-and-motion with and without taking radialdistortion into account in the bundle adjustment. Bottom: 6 of the 250 frames of the fused video sequence.

Page 24: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

230 Pollefeys et al.

radial distortion would not yield satisfying results. Inthe rest of Fig. 13 some frames of the augmented videoare shown.

6. Conclusion

In this paper a complete system for visual modelingwith a hand-held camera was presented. The systemcombines different components that gradually retrieveall the information that is necessary to construct visualmodels from images. Automatically extracted featuresare tracked or matched between consecutive viewsand multi-view relations are robustly computed. Basedon this the projective structure and motion is deter-mined and subsequently upgraded to metric throughself-calibration. Bundle-adjustment is used to refinethe results. Then, image pairs are rectified and matchedusing a stereo algorithm and dense and accurate depthmaps are obtained by combining measurements of mul-tiple pairs. For a five image sequence of video resolu-tion, the computation of the metric structure and mo-tion, including bundle adjustment and self-calibration,takes about 20 seconds. Computing pairwise stereotakes less than a minute per pair and performing multi-view linking for the complete sequence takes another2 minutes, so that the complete automated processingof such a sequence requires 6–7 minutes on a standardPC.

From the computed results different types of visualmodels can be constructed. First the traditional ap-proach that consists of constructing a textured 3D meshwas presented, then an image-based approach was de-scribed and extended to yield both view-dependent ge-ometry and texture. This last approach allows to ef-ficiently capture visually complex scenes. It was alsoshown that the proposed approach could be used tocombine real and virtual scenes in video sequences.

Acknowledgment

We acknowledge the financial support of the FWOproject G.0223.01, the European IST projects Vibes,Murale and InView, as well as of the NSF IIS 0237533grant. Part of this research was carried out in collabo-ration with Benno Heigl and colleagues from the uni-versity of Erlangen-Nurnberg. The GPU-based imple-mentation of our combined image- and geometry-basedrendering was done by Alexander Thomas. We are alsograteful to Marc Waelkens and his colleagues for their

assistance and for allowing us to record data on thearchaeological site of Sagalassos.

Note

1. In practice, to avoid selecting too many key-frames, we proposeto pick a key-frame at the last frame for which ni ≥ 0.9ne with ni

the number of valid tracks and ne the number of valid tracks whenthe epipolar geometry model overtakes the homography model.

References

Beardsley, P., Zisserman, A., and Murray, D. 1997. Sequentialupdating of projective and affine structure from motion. Inter-national Journal of Computer Vision, 23(3):235–259.

Buehler, C., Bosse, M., McMillan, L., Gortler, S., and Cohen, M.2001. Unstructured lumigraph rendering. In Proceedings ACMSIGGRAPH 2001, pp. 425–432.

Chai, J.-X., Tong, X., Chan, S.-C., and Shum, H.-Y. 2000. Plenopticsampling, In Proc. Siggraph, pp. 307–318.

Chum, O. and Matas, J. 2002. Randomized ransac with td, d test. InProceedings of the British Machine Vision Conference , P. Rosinand D. Marshall (Eds.), BMVA: London, UK, vol. 2, pp. 448–457.

Cornelis K., Pollefeys M., Vergauwen, M., and Van Gool, L. 2001.Augmented reality from uncalibrated video sequences. In 3DStructure from Images—SMILE 2000, M. Pollefeys, L. Van Gool,A. Zisserman, and A. Fitzgibbon (Eds.), Lecture Notes in Com-puter Science, Springer-Verlag, vol. 2018, pp. 150–167.

Cox, I., Hingorani, S., and Rao, S. 1996. A maximum likelihoodstereo algorithm. Computer Vision and Image Understanding,63(3).

Curless, B. and Levoy, M. 1996. A volumetric method for buildingcomplex models from range images. In Proc. SIGGRAPH ’96,pp. 303–312.

Debevec, P., Borshukov, G., and Yu, Y. 1998. Efficient view-dependent image-based rendering with projective texture-mapping. In 9th Eurographics Rendering Workshop, Vienna, Aus-tria.

Debevec, P., Taylor, C., and Malik, J. 1996. Modeling and renderingarchitecture from photographs: A hybrid geometry- and image-based approach. In Proc. SIGGRAPH’96, pp. 11–20.

Falkenhagen, L. 1997. Hierarchical block-based disparity estima-tion considering neighbourhood constraints. In Proceedings Inter-national Workshop on SNHC and 3D Imaging, Rhodes, Greece,pp. 115–122.

Faugeras, O. 1992. What can be seen in three dimensions with anuncalibrated stereo rig. In Computer Vision—ECCV’92, LectureNotes in Computer Science, Springer-Verlag, vol. 588, pp. 563–578.

Faugeras, O., Luong, Q.-T., and Maybank, S. 1992. Camera self-calibration: Theory and experiments. In Computer Vision—ECCV’92, Lecture Notes in Computer Science, Springer-Verlag,vol. 588, pp. 321–334.

Faugeras, O., Luong, Q.-T., and Papadopoulo, T. 2001. The Geometryof Multiple Images. MIT Press.

Fischler, M. and Bolles, R. 1981. Random Sampling consensus: Aparadigm for model fitting with application to image analysis and

Page 25: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

Visual Modeling with a Hand-Held Camera 231

automated cartography. Commun. Assoc. Comp. Mach., 24:381–95.

Fitzgibbon, A. and Zisserman, A. 1998. Automatic camera recov-ery for closed or open image sequences. In Computer Vision—ECCV’98, Lecture Notes in Computer Science, vol. 1406,Springer-Verlag, vol. 1, pp. 311–326.

Gortler, S., Grzeszczuk, R., Szeliski, R., and Cohen, M.F. 1996. TheLumigraph. In Proc. SIGGRAPH ’96, ACM Press: New York,pp. 43–54.

Harris, C. and Stephens, M. 1988. A combined corner and edgedetector. In Fourth Alvey Vision Conference, pp. 147–151.

Hartley, R. 1994. Euclidean reconstruction from uncalibrated views.In Applications of Invariance in Computer Vision, J.L. Mundy,A. Zisserman, and D. Forsyth (Eds.), Lecture Notes in ComputerScience, Springer-Verlag, vol. 825, pp. 237–256.

Hartley, R. 1997. In defense of the eight-point algorithm. IEEE Trans.on Pattern Analysis and Machine Intelligence, 19(6):580–593.

Hartley, R., Gupta, R., and Chang, T. 1992. Stereo from uncalibratedcameras. In Proc. Conference Computer Vision and Pattern Recog-nition, pp. 761–764.

Hartley, R. 1998. Chirality. International Journal of Computer Vi-sion, 26(1):41–61.

Hartley, R. and Sturm, P. 1997. Triangulation. Computer Vision andImage Understanding, 68(2):146–157.

Hartley, R. and Zisserman, A. 2000. Multiple View Geometry in Com-puter Vision. Cambridge University Press.

Heigl, B., Koch, R., Pollefeys, M., Denzler, J., and Van Gool, L.1999. Plenoptic modeling and rendering from image sequencestaken by hand-held camera. In Proc. DAGM’99, pp. 94–101.

Heyden, A. and Astrom, K. 1996. Euclidean reconstruction fromconstant intrinsic parameters. In Proc. 13th International Confer-ence on Pattern Recognition, IEEE Computer Soc. Press, pp. 339–343.

Heyden, A. and Astrom, K. 1997. Euclidean reconstruction from im-age sequences with varying and unknown focal length and prin-cipal point. In Proc. IEEE Conference on Computer Vision andPattern Recognition, IEEE Computer Soc. Press, pp. 438–443.

Kahl, F. 1999. Critical motions and ambiuous euclidean reconstruc-tions in auto-calibration. In Proc. ICCV, pp. 469–475.

Koch, R. 1996. Automatische oberflachenmodellierung starrer drei-dimensionaler Objekte aus stereoskopischen Rundum-Ansichten,PhD thesis, University of Hannover, Germany, also published asFortschritte-Berichte VDI, Reihe 10, No. 499, VDI Verlag, 1997.

Koch, R., Heigl, B., Pollefeys, M., Van Gool, L., and Niemann, H.1999a. A geometric approach to lightfield calibration. In Proc.CAIP99, Springer-Verlag, vol. LNCS 1689, pp. 596–603.

Koch, R., Pollefeys, M., and Van Gool, L., 1998. Multi viewpointstereo from uncalibrated video sequences. In Proc. European Con-ference on Computer Vision, Freiburg, Germany, pp. 55–71.

Koch, R., Pollefeys, M., Heigl, B., Van Gool, L., and Niemann, H.1999b. Calibration of hand-held camera sequences for plenopticmodeling. In Proc.ICCV’99 (International Conference on Com-puter Vision), Corfu (Greece), pp. 585–591.

Kutulakos, K.N. and Seitz, S.M. 2000. A theory of shape by spacecarving. International Journal of Computer Vision, 38(3):199–218.

Laveau, S. and Faugeras, O. 1996. Oriented projective geometry forcomputer vision. In Computer Vision—ECCV’96, B. Buxton andR. Cipolla (Eds.), Lecture Notes in Computer Science, Springer-Verlag, vol. 1064, pp. 147–156.

Levoy, M. and Hanrahan, P. 1996. Lightfield Rendering. In Proc.SIGGRAPH ’96, ACM Press: New York, pp. 31–42.

Lorensen, W. and Cline, H. 1987. Marching cubes: A high resolution3D surface construction algorithm. Computer Graphics (Proceed-ings of SIGGRAPH 87), 21(4):163–169.

Lowe, D. 1999. Object recognition from local scale-invariant fea-tures. In Proc. International Conference on Computer Vision,pp. 1150–1157.

Matei, B. and Meer, P. 2000. A general method for errors-in-variablesproblems in computer vision. In Proc. CVPR 2000, IEEE Com-puter Society Conference on Computer Vision and Pattern Recog-nition, IEEE Computer Society Press: Los Alamitos, CA (HiltonHead Island, South Carolina, vol. 2, pp. 18–25.

McMillan, L. and Bishop, G. 1995. Plenoptic modeling: Animage-based rendering system. In Proc. SIGGRAPH’95, pp. 39–46.

Ofek, E., Shilat, E., Rappopport, A., and Werman, M. 1997. Highlightand reflection independent multiresolution textures from imagesequences. IEEE Computer Graphics and Applications, 17(2).

Okutomi, M. and Kanade, T. 1993. A multiple-baseline stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence,15(4):353–363.

Pollefeys, M. 1999. Self-calibration and metric 3D reconstructionfrom uncalibrated image sequences, Ph.D. dissertation, ESAT-PSI,K.U. Leuven.

Pollefeys, M., Koch, R., and Van Gool, L. 1998. Self-calibration andmetric reconstruction in spite of varying and unknown internalcamera parameters. In Proc. International Conference on Com-puter Vision, Narosa Publishing House, pp. 90–95.

Pollefeys, M., Koch, R., and Van Gool, L. 1999. A simple and effi-cient rectification method for general motion. In Proc. ICCV’99(International Conference on Computer Vision), Corfu (Greece),pp. 496–501.

Pollefeys, M., Koch, R., and Van Gool, L. 1999. Self-calibrationand metric reconstruction in spite of varying and unknown inter-nal camera parameters. International Journal of Computer Vision,32(1):7–25.

Pollefeys, M. and Van Gool, L. 1999. Stratified self-calibration withthe modulus constraint. IEEE transactions on Pattern Analysisand Machine Intelligence. 21(8):707–724.

Pollefeys, M., Verbiest, F., and Van Gool, L. 2002. Surviving domi-nant planes in uncalibrated structure and motion recovery. In Com-puter Vision—ECCV 2002, 7th European Conference on Com-puter Vision, A. Heyden, G. Sparr, M. Nielsen, P. Johansen (Eds.),Lecture Notes in Computer Science, vol. 2351, pp. 837–851.

Press, W., Teukolsky, S., and Vetterling, W. 1992. Numerical Recipesin C: The Art of Scientific Computing, Cambridge University Press.

Rousseeuw, P. 1987. Robust Regression and Outlier Detection.Wiley: New York.

Sawhney, H., Hsu, S., and Kumar, R. 1998. Robust video mosaic-ing through topology inference and local to global alignment. InComputer Vision—ECCV’98: Proc. 5th European Conference onComputer Vision, Lecture Notes in Computer Science, Springer-Verlag, vol. II, pp. 103–119.

Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation ofdense two-frame stereo correspondence algorithms. InternationJournal of Computer Vision, 47(1/2/3):7–42.

Schmid, C. and Mohr, R. 1997. Local grayvalue invariants for im-age retrieval. IEEE Transaction on Pattern Analysis and MachineIntelligence, 19(5):530–534.

Page 26: Visual Modeling with a Hand-Held Camera€¦ · 3d modeling in vision were robot navigation and vi-sual inspection. Nowadays however the emphasis has changed. There is more and more

232 Pollefeys et al.

Schroeder, W., Zarge, J., and Lorensen, W. 1992. Decimation oftriangle meshes. Computer Graphics (Proceedings of SIGGRAPH92), 26(2):65–70.

Shi, J. and Tomasi, C. 1994. Good features to track. In Proc.IEEE Conference on Computer Vision and Pattern Recognition(CVPR’94), pp. 593–600.

Slama, C. 1980. Manual of Photogrammetry, 4th edition. Americansociety of photogrammetry: Church, VA, USA, Falls.

Soucy, M. and Laurendeau, D. 1995. A general surface approachto the integration of a set of range views. IEEE Transactions onPattern Analysis and Machine Intelligence, 17(4):344–358.

Sturm, P. 1997a. Critical motion sequences for monocular self-calibration and uncalibrated euclidean reconstruction. In Proc.1997 Conference on Computer Vision and Pattern Recognition,IEEE Computer Soc. Press, pp. 1100–1105.

Sturm, P. 1997b. Vision 3D non calibree-contributions a la re-construction projective et etude des mouvements critiques pourl’auto-calibrage. Ph.D. Thesis, Institut National Polytechnique deGrenoble, 1997.

Sturm, P. 1999. Critical motion sequences for the self-calibration ofcameras and stereo systems with variable focal length. In Proceed-ings of the tenth British Machine Vision Conference, T. Pridmoreand D. Elliman (Eds.), Nottingham, England, British Machine Vi-sion Association, pp. 63–72.

Tomasi, C. and Kanade, T. 1992. Shape and motion from imagestreams under orthography: A factorization approach. Interna-tional Journal of Computer Vision, 9(2):137–154.

Torr, P. 1995. Motion Segmentation and Outlier Detection. PhDThesis, Dept. of Engineering Science, University of Oxford.

Torr, P., Fitzgibbon, A., and Zisserman, A. 1998. Maintaining mul-tiple motion model hypotheses through many views to recovermatching and structure. In Proc. ICCV, pp. 485–491.

Triggs, B. 1997. The absolute quadric. In Proc. 1997 Conference onComputer Vision and Pattern Recognition, IEEE Computer Soc.Press, pp. 609–614.

Triggs, B., McLauchlan, P., Hartley, R., and Fiztgibbon, A. 2000.Bundle adjustment—A modern synthesis. In Vision Algorithms:Theory and Practice, Triggs B., Zisserman A., Szeliski R. (Eds.),Springer-Verlag, vol. LNCS 1883, pp. 298–372.

Turk, G. and Levoy, M. 1994. Zippered polygon meshes from rangeimages. In Proceedings of SIGGRAPH ’94 pp. 311–318.

Tuytelaars, T. and Van Gool, L. 2000. Wide baseline stereo based onlocal, affinely invariant regions. In British Machine Vision Con-ference, pp. 412–422.

Van Meerbergen, G., Vergauwen, M., Pollefeys, M., and Van Gool,L. 2002. A hierarchical symmetric stereo algorithm using dynamicprogramming. International Journal on Computer Vision, 47(1–3):275–285.

Wheeler, M., Sato, Y., and Ikeuchi, K. 1998. Consensus surfacesfor modeling 3D objects from multiple range images. In SixthInternational Conference on Computer Vision, pp. 917–924.

Willson, R. 1994. Modeling and Calibration of Automated ZoomLenses. Ph.D. thesis, Department of Electrical and Computer En-gineering, Carnegie Mellon University.

Zhang, Z., Deriche, R., Faugeras, O., and Luong, Q.-T. 1995. Arobust technique for matching two uncalibrated images through therecovery of the unknown epipolar geometry. Artificial IntelligenceJournal, 78:87–119.