robust multi-view camera calibration forwide-baseline ... · tilt-zoom cameras to rail and spider...

Robust Multi-View Camera Calibration for Wide-Baseline Camera Networks

Jens PuweinETH Zurich

[email protected]

Remo Ziegler, Julia VogelLiberoVision

http://www.liberovision.com

Marc PollefeysETH Zurich

[email protected]

Abstract

Real-world camera networks are often characterized byvery wide baselines covering a wide range of viewpoints.We describe a method not only calibrating each camera se-quence added to the system automatically, but also takingadvantage of multi-view correspondences to make the en-tire calibration framework more robust. Novel camera se-quences can be seamlessly integrated into the system at anytime, adding to the robustness of future computations.

One of the challenges consists in establishing correspon-dences between cameras. Initializing a bag of features froma calibrated frame, correspondences between cameras areestablished in a two-step procedure. First, affine invariantfeatures of camera sequences are warped into a commoncoordinate frame and a coarse matching is obtained be-tween the collected features and the incrementally built andupdated bag of features. This allows us to warp images toa common view. Second, scale invariant features are ex-tracted from the warped images. This leads to both morenumerous and more accurate correspondences. Finally, theparameters are optimized in a bundle adjustment. Addingthe feature descriptors and the optimized 3D positions tothe bag of features, we obtain a feature-based scene ab-straction, allowing for the calibration of novel sequencesand the correction of drift in single-view calibration track-ing. We demonstrate that our approach can deal with widebaselines. Novel sequences can seamlessly be integrated inthe calibration framework.

1. Introduction and Related WorkMotivation Camera networks are prevalent in many

areas like surveillance applications and sports broadcasts.Multiple calibrated cameras capturing the same scene canbe harnessed to extract 3D data from the 2D images. Thisdata can be measurements like the height, the 3D position orthe movement of an object. It may even be used to synthe-size novel views or to add virtual drawings to the scene, e.g.to enhance the experience of sports broadcasts [11, 12, 7].

The cameras capturing the scene can range from pan-

Figure 1. Starting from a calibrated frame, the proposed methodcalibrates a network of cameras. The images shown were capturedby cameras separated by a very wide baseline. The successfulcalibration is visualized by projecting a wireframe model of theplaying field.

tilt-zoom cameras to rail and spider cameras (i.e. movingcameras). To capture the scene from different viewpointsand to recover as much information as possible, camerasare often separated by very wide baselines. While thisprovides the user with extensive coverage, the featurematching between different cameras, beneficial or evennecessary for camera calibration and 3D reconstruction,becomes very difficult.

Related Work Several techniques for feature extractionand description have been proposed over the past few years[8, 10, 18]. The degree of invariance towards changes inviewpoint differs and comes at the cost of distinctiveness.When captured by a camera, planar surface patches undergoperspective transformations. The synthesis of novel viewsand viewpoint normalization can reverse this process. Con-sequently, descriptors extracted from such views need notbe invariant towards affine or projective transformations andcan therefore be more distinctive. Normalized viewpointshave been used successfully to improve the matching of im-

321978-1-4244-9497-2/10/$26.00 ©2010 IEEE

age regions [6, 19]. We use the idea of normalized patchesto extract matches on dominant scene planes.

If it is known that cameras are capturing the same scenefor a long periode of time, it makes sense to extract an ab-stract representation or to build a model of the scene beingcaptured. The extraction of features from images to gener-ate abstract representations of the real world is commonlyused in location recognition and visual localization. Fea-ture descriptors are not only associated with an image and2D geometry information like scale and rotation, but theyare linked to 3D geometry. Descriptors extracted from anew image are matched to a database containing feature de-scriptors and 3D information. The 3D information can beused for pose estimation and efficient matching [19, 2].

To the best of our knowledge no multi-view approachesfor camera calibration have been presented in sportsspecific research. Large parts of the scenes are dynamic,cameras are separated by wide baselines and features areoften occluded. This makes it hard to extract multi-viewcorrespondences. Existing approaches deal with eachcamera separately, often relying on the known appearanceand geometry of the playing field [16, 3]. There, linesare used to initialize and track the calibration or to avoiddrift when using optical flow based trackers. While theseapproaches proved to be rather successful for single-viewusage like augmented reality on the playing field, they lackthe multi-view constraints, which reduce the relative errorbetween cameras. This is important for 3D reconstructionor free-viewpoint rendering.

Contributions In this paper we present a frameworkto calibrate camera sequences separated by wide baselines,where additional sequences can be integrated into the sys-tem at any time, adding to its stability. Cameras are notrequired to capture the scene at the same time and, in par-ticular, they do not have to be synchronized. By providingthe calibration of a single frame, a method to extract and up-date a feature-based representation of the scene being cap-tured is bootstrapped, taking advantage of different kinds offeatures.

In the feature matching stage, the large tolerance whichMSER [10] features provide towards changes in viewpointis leveraged and combined with SIFT [8] in a two-stepmatching procedure. The special structure of homographiesmapping scene planes to images leads to a very efficient2-point RANSAC when matching MSER features. To in-crease the robustness of the matching stage, features ex-tracted from several frames are transformed and collectedbefore matching. A feature based representation of thescene is created and updated using a very simple yet ef-fective strategy based on the visibility and the uniquenessof features.

This paper is structured as follows. An overview of the

Figure 2. MSER and SIFT features are stored in a regularly up-dated bag of features representing the playing field. Red dots indi-cate positions on the playing field associated with the features.

proposed system is given in Section 2, followed by a formaldescription of the technical part in Section 3. The setup forour experiments and the results can be found in Section 4.Finally, we conclude the paper in Section 5.

2. System OverviewUsually the raw material of sports broadcast consists of

the video streams provided by a variety of cameras, in-cluding fixed cameras, pan-tilt-zoom cameras, rail cameras,spider cameras, cameras mounted on large robot arms andcameras carried by humans. All of them except for thecompletely fixed cameras change zoom and orientation andsome of them also change pose. In the following we workwith sequences of video streams provided by such cameras.Our goal is to determine the 6 degrees of freedom of thecamera extrinsics as well as the focal lengths, both for ev-ery frame. Radial distortion can be estimated in the bundleadjustment step [17].

An overview of the presented system is depicted in Fig-ure 3. It is initialized with a single calibrated frame (e.g.Figure 6(a)), which determines absolute scale, rotation andtranslation of the scene. The initial calibration is prop-agated to a few neighboring frames using a KLT featuretracker [9, 13]. Given the resulting coarse calibration forthose neighboring frames, MSER features are extracted andwarped onto the playing field, i.e. the ellipses characteriz-ing the MSER features are backprojected onto the playingfield. The warped features initialize a bag of features (Fig-ure 2), providing an abstract representation of the playingfield (Section 3.1). This concludes the initialization stage(Figure 3(a)).

Whenever a new sequence is added to the system in themain loop (Figure 3(b)), the input images are registered us-ing homographies. MSER features are extracted in all im-ages and collected in a reference image. This feature col-lection is matched with the bag of features more robustlythan features extracted from a single frame only. The affinetransformations extracted from MSER feature matches lead

322

calibrated frame

coarse calibration

bag of features(MSER)

track calibration

extract MSER features

MSER featurecollection

warp to playingfield

Section 3.1

new sequenceMSER feature

collection

bag of features(MSER, SIFT)

bundleadjustment

fronto-parallel view

match MSER features,

calibrate camera

match SIFT features

register images,extract and warp

MSER features

extract SIFT features

SIFT features

correspondencesoptimized SIFT and

MSER features

update

optimized cameraparameters

Section 3.2

Section 3.4

Section 3.3

fronto-parallel view

extract SIFT features

SIFT features

(a) (b)

Figure 3. (a) The initialization of the bag of features (b) The main loop

to a very efficient 2-point RANSAC (Section 3.2).From the resulting MSER matches coarse initial calibra-

tions are obtained, which allows us to warp input imagesinto fronto-parallel views before SIFT features are extractedand matched. In a subsequent bundle adjustment the cameraparameters and the positions of the features on the playingfield are refined (Section 3.3). The bag of features is up-dated by warping and adding the new MSER features andSIFT features along with their optimized 3D positions. Inorder not to grow the bag of features continously and inorder to keep it discriminative, a simple updating schemebased on visibility, scale and viewpoint criteria is applied(Section 3.4).

3. Technical Part

3.1. Initializing the System

The whole system is initialized with a calibrated frame.There are several ways to obtain such a frame. While au-tomatic line based methods exist [16], it is difficult to re-liably detect field lines in sports like ice hockey and bas-ketball. Therefore, the user is required to hand-click a fewpredefined points on the playing field (e.g. corners of play-ing field) to make sure that the initial calibration is correct.Due to the standardized geometry the absolute scale of thescene is known. The 3D coordinate frame is chosen suchthat the playing field coincides with the xy-plane. Startingfrom this calibrated frame, the calibration is propagated toa few neighboring frames using a GPU based KLT featuretracker. Since the calibration and the geometry of the sta-dium are known, only features lying on the playing fieldare tracked. Additionally, a simple chromakeying based ona Gaussian mixture model of the colors on the playing fieldfollowed by some morphological operations allows us to get

rid of most of the feature points detected on players, refer-ees and balls. Feature tracks are pruned in a RANSAC stepby making sure they can be explained by a homography.

Once all frames of the initial sequence are calibrated,MSER features are extracted for several keyframes, againignoring parts of the image not corresponding to the play-ing field. Every 15th frame of the sequence is chosen tobe a keyframe. Applying the affine invariant MSER featuredetector results in a number of interest points and ellipses[10]. Warping an image patch centered at the interest pointsuch that the ellipse associated with it becomes a unit circlerenders the image patch invariant to affine transformations,except for rotation. Using the SIFT descriptor, the ambi-guity in rotation is removed by aligning the image patchaccording to its dominant orientation [8]. The SIFT de-scriptor extracted from the normalized image patch is usedfor matching the MSER feature with other MSER features.Combining the normalizing affine transformations from twocorresponding MSER features leads to an affine transforma-tion between them, i.e. an affine correspondence.

Since the camera calibration of the initial sequence isknown, the MSER features extracted from novel query im-ages can be matched directly with the playing field insteadof the calibrated images. For that purpose the affine trans-formations extracted from the calibrated frames need to beupdated accordingly. Each ellipse in an image correspondsto an ellipse on the playing field. In order to be able to ex-tract affine correspondences between the playing field anda query image, the transformations mapping the ellipses inthe calibrated images to oriented unit circles need to be pre-ceeded by the transformations mapping the ellipses on theplaying field to the ellipses in the images. These local map-pings of image patches on the playing field to the corre-sponding patches in the images are given by the Jacobians

323

of the homographies mapping the playing field to the im-ages.

Due to the spatial coherence within successive framessome features are extracted several times. All MSER fea-tures together with their affine transformations and 3D posi-tions are used to initialize the bag of features. While dupli-cates can be detected and eliminated by matching featuresacross frames, the task of handling multiple occurences andfeatures not useful for matching is left to the bag of features,as explained in Section 3.4.

3.2. Adding Additional Sequences

Once a novel sequence is added to the system, MSERfeatures are extracted exactly like in the initial step. Sincethe camera parameters are unknown, the audience cannot beeliminated as before. However, assuming the camera showsthe playing field to a large extent, the chromakeying, theKLT feature tracker and the RANSAC step can still be usedto get homographies relating the image parts showing theplaying field in neighboring frames. By accumulating thehomographies, a mapping between any two images of thesequence is provided. In particular, this leads to a mappingfrom any image to the first image of the sequence. Thismeans that all MSER features can be warped into the co-ordinate frame of the first image. Again, apart from thefeature positions the affine transformations normalizing theimage patches need to be adjusted as well. This is analogueto Section 3.1. Instead of using the mapping of the playingfield to the image, the homography between the two imagesis used.

To obtain a coarse calibration for the new sequence, thecollected features are matched with the bag of features. Amatch between MSER features fi and fj is a valid candi-date match if and only if fi is the most similar feature tofj and vice versa. The resulting matches usually containmany outliers. The additional information provided by theellipses obtained from MSER detections provides us withthe means to apply an efficient 2-point RANSAC to extractthe homography mapping the playing field to the image [1].We pick up ideas from Koeser et al. to calculate a homogra-phy from two affine correspondences [5]. The key observa-tion is that, locally, affine feature correspondences providea first order Taylor approximation to homographies. WhileKoeser et al. present the case of conjugate rotations, a dif-ferent method is necessary to deal with unkown poses andunknown focal lengths. The homography H describing themapping from the playing field to the image has the follow-

ing structure:

H =

h11 h12 h13

h21 h22 h23

h31 h32 h33

= λK

f 0 00 f 00 0 1

(r1 r2 t). (1)

K contains known intrinsic camera parameters and f de-notes the unknown focal length. r1 and r2 denote the firstand the second column of the camera rotation matrix, re-spectively, and t is a translation vector. Since r1 and r2 arecolumns of a rotation matrix, the following constraint on theentries of H has to hold:

0 = (h11h12 + h21h22)(h232 − h2

31)+h31h32(h2

11 + h221 − h2

12 − h222) (2)

One affine correspondence together with one point corre-spondence is already enough to determine the homogra-phy. Actually, it is already enough to have the x- or the y-coordinate of the point correspondence while the second co-ordinate can already be used to verify the generated hypoth-esis. A more detailed derivation is given in the appendix.

If the hypothesis obtained from two affine correspon-dences is valid, it is verified using the remaining correspon-dences. The threshold in the RANSAC procedure shouldbe adapted for individual features due to the warp of thefeatures. Each feature is warped using a homography. Thedeterminant of the Jacobian of this mapping indicates theamount of magnification of the image area around a fea-ture. The RANSAC threshold can be adjusted accordinglyfor every feature. If the number of inliers exceeds a certainthreshold and the hypothesized homography does not de-scribe a degenerate case, correspondences are assumed tobe valid and they are used to initialize the camera parame-ters of the new sequence.

3.3. Calibration Refinement

The coarse initial calibration of all sequences added tothe system so far can be refined in a bundle adjustment atany time. Since the correspondences obtained using MSERfeatures are not very accurate and not very numerous, a dif-ferent strategy is applied to extract the correspondences forthe bundle adjustment.

The coarse calibrations obtained from the MSERmatches provide a very good idea of which views shouldbe overlapping. For each pair of views the amount of over-lap of image regions representing the playing field is deter-mined. Views are connected based on the overlap. All im-ages are warped such that the image plane of the syntheticview is parallel to the playing field. For all warped views,

324

(a) (b)

Figure 4. (a) MSER feature matches from original images (b) SIFT feature matches from fronto-parallel views

SIFT features are extracted and matched with the bag of fea-tures. Due to the already available coarse calibration onlythose features lying in the field of view are considered. Fi-nally, for every pair of connected views, correspondencesare extracted from the SIFT features in the warped images.Since we are dealing with fronto-parallel views of the sameorientation, a very simple 1-point RANSAC or even exhaus-tive search can be applied. Figure 4 shows an example ofMSER correspondences obtained from a pair of images andSIFT correspondences obtained from fronto-parallel views.

Once all connected views are processed, a graph havingthe features as nodes is constructed. Whenever two featuresmatch, an edge between them is introduced, i.e. featurematches are organized into tracks, as it was done by Snavelyet al. [15]. In this graph, connected components with nnodes correspond to 3D points seen from n views. Togetherwith the initial camera parameters, the DLT-algorithm pro-vides initial values for the points on the playing field [4]. Asexplained before, only points on the playing field are con-sidered. Hence, correspondences that cannot be explainedby a homography are removed in a RANSAC step. The ini-tial parameters for the cameras and the 3D points are refinedin a bundle adjustment. The 3D points are constrained tolie on the playing field. Once several sequences have beenadded and the bundle adjustment contains a large number ofimages, the resulting camera parameters and 3D points arenot optimized any further, but considered as fixed parame-ters in future bundle adjustments. After the bundle adjust-ment all feature positions are updated. New MSER featuresand SIFT features with optimized 3D positions and cameraparameters are added to the bag of features. Again, MSERfeatures not lying on the playing field are ignored. Featureswhich are already in the bag are updated. As in the initialstep (Section 3.1), the elimination of duplicates and useless

features is left to the bag of features. Additionally, at somepoint, the number of images and MSER features becomestoo large. The next subsection explains the maintenance ofthe bag of features.

3.4. Bag of Features

As explained before, the bag of features contains MSERfeatures and affine transformations, as well as SIFT featuresextracted from fronto-parallel views. This provides an ab-straction of the scene (see Figure 2), which gives a lot offlexibility for future calibration tasks. New sequences canbe added at any time, enhancing the bag of features.

In order not to fill the bag of features with many similaror useless features, we propose a simple updating scheme.While features not lying inside the playing field mask werealready removed in the previous steps (see Section 3.2),there might still be duplicates and many features which can-not be used for matching. For that purpose we keep trackof how often an MSER feature should have been matchedand how often it actually was matched. Whenever a newsequence is added and calibrated successfully, it is knownwhich features from the bag have been visible in one of thenew views and, therefore, should have been used. Since fea-tures are covered (e.g. by players) quite often, the match-ing frequency cannot be very high. To keep the bag as dis-criminative as possible, features should only be eliminatedif they are not unique in terms of scale and viewpoint. Fur-thermore a feature should not be removed from the bag aslong as it has not been matched at least 10 times althoughit should have been visible. Additionally, features acquiredfrom very different viewpoints are not required to match atall, i.e. they are not considered as features that should havebeen used. After adding several new sequences to the bagof features a pruning step is applied. All features with a

325

(a) (b)

Figure 5. (a) Red dots indicate features that are removed from the bag of features during the update step. Features which are considered tobe useful and thus remain in the bag are indicated as green dots. (b) The fronto-parallel view shows that logos on the playing field coincidewith the green dots.

matching ratio below 0.1 are removed.The bag of features is divided into a regular grid. Each

grid cell corresponds to a patch on the playing field, 15cmby 15cm. When eliminating features, each grid cell is con-sidered independently of all the others. Figure 5 illustratesthe effect of updating the bag of features. Features whichhave been added to the bag only recently and therefore can-not be removed yet are not shown.

It might happen that a novel sequence cannot be matchedwith the bag of features. In this case the calibration ofthis sequence is postponed until a few other sequences havebeen added to the system. Since for some sequences the cal-ibration might never succeed and every attempt to calibratesuch sequences is a waste of resources, such sequences arestalled for an increasing number of new sequences. Thenumber of intermediate sequences is increased as a powerof 2, i.e. such sequences are added again after 1 new se-quence, after 2 new sequences, after 4 new sequences andso on.

In order to avoid drift in single camera calibration track-ing, keyframes can be matched with the bag of features.

4. EvaluationExperiments are conducted on basketball and ice hockey

footage. The initial calibration of the starting frame is doneby picking 3D to 2D correspondences by hand (the geome-try of the playing field is standardized and known, e.g. thefreethrow line in basketball). Figure 6(a) shows the initialcalibration.

In order to evaluate the method, 25 random sequencesfrom a game of ice hockey (SD, resolution of 720x576)and 20 random sequences from a game of basketball (HD,1280x720) are chosen. Each sequence is 60 frames long.For each sequence 4 out of the 60 frames are selected forcalibration, which amounts to a total of 100 frames for theice hockey footage and 80 frames for the basketball footage.

The robustness of the system is inspected by counting thenumber of successfully calibrated frames. A calibration isconsidered successful if the average reprojection error is be-low 5 pixels. Additionally the results of the calibration arevisualized by rendering the lines of the playing field modelonto the original input images (Figures 6(b) and 6(c)).

For the ice hockey game, 52% of the sequences are cal-ibrated successfully. While sequences showing large partsof the field are always calibrated successfully (see Figure6(b)), cameras undergoing large zooms and fast movementsare almost impossible to calibrate. The same holds for cam-eras filming at the height of the playing field, i.e. level withthe players (see Figure 6(d)). Since the sequences are cho-sen completely at random without any subsequent selectionand the footage at disposal often shows closeups of the goalarea, the rate of successful calibration is not as high as onewould expect. Nevertheless, the system manages to cali-brate difficult sequences captured from new viewpoints sep-arated by very wide baselines. The success rate of the cam-era shown in Figure 6(c) is at 67%, despite of fast cameramovements.

The cameras capturing the basketball game are cali-brated at a success rate of 51%. Apart from the typicaloverview camera the scene was captured by three morecameras. Again many sequences show closeups of the play-ers. The field of view is often crowded with players occlud-ing most of the features all the time. In such cases collect-ing features from different frames before matching does nothelp.

The implementation is not completely optimized forspeed. Much time is spent on the extraction of features andthe calculation of descriptors. Once extracted, descriptorsare matched quite easily and efficiently on the GPU. Whilethere exists a publicly available GPU implementation forSIFT feature extraction and matching [14], MSER featuresare handled by the CPU.

326

(a) (b)

(c) (d)

Figure 6. (a) Initial calibration from hand-clicked points, indicated as green squares (b) Typical overview camera (c) New view, separatedby a wide baseline (d) Typical failure case

5. Conclusion and Future Work

In this paper we have presented an approach capable ofdealing with the calibration of wide-baseline camera net-works. We have shown its use for robust calibration and wehave introduced a framework to seamlessly integrate novelsequences whenever necessary. The correspondences be-tween images are obtained in several steps, coarse to fine,by collecting features and matching features in a two-stepmethod involving MSER features and SIFT features ex-tracted from fronto-parallel views. Each sequence addedis used to update a bag of features representing the playingfield, increasing the robustness of the system.

To further improve the quality of our method we plan toinclude field lines in the calibration procedure. This shouldincrease the robustness in cases where field lines are visible.Additionally some stadiums offer diverse banners, whichcould be added to the system to further increase its robust-ness. With some restrictions the same holds for the audiencein the stadium. Since both the banners and the audience ap-proximately represent large scene planes, efficient matchingshould be possible.

Most of the cameras filming team sports such as basket-ball and ice hockey are mounted on some kind of device.Examples are pan-tilt-zoom cameras, cameras mounted onrobot arms and cameras mounted on rails or wires. Mod-

eling the resulting reduced space of possible calibrationscould increase both robustness and accuracy.

6. AcknowledgmentsThe data is courtesy of Teleclub and LiberoVision. This

project is supported by a grant of CTI Switzerland and the4DVideo ERC Starting Grant Nr. 210806.

References[1] M. Fischler and R. Bolles. Random sample consensus: A

paradigm for model fitting with applications to image analy-sis and automated cartography. Commun. ACM, 24(6):381–395, 1981.

[2] F. Fraundorfer, C. Wu, J.-M. Frahm, and M. Pollefeys. Visualword based location recognition in 3d models using distanceaugmented weighting. In International Symposium on 3DData Processing, Visualization and Transmission, 2008.

[3] S. Gedikli, J. Bandouch, N. von Hoyningen-Huene,B. Kirchlechner, and M. Beetz. An adaptive vision systemfor tracking soccer players from variable camera settings. InProceedings of the 5th International Conference on Com-puter Vision Systems (ICVS), 2007.

[4] R. I. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision. Cambridge University Press, 2000.

[5] K. Koeser, C. Beder, and R. Koch. Conjugate rotation: Pa-rameterization and estimation from an affine feature corre-

327

spondence. In Proceedings IEEE Conference on ComputerVision and Pattern Recognition, 2008.

[6] K. Koeser and R. Koch. Perspectively invariant normal fea-tures. In Proceedings IEEE International Conference onComputer Vision, 2007.

[7] Liberovision. http://www.liberovision.com/.[8] D. Lowe. Distinctive image features from scale-invariant

keypoints. International Journal of Computer Vision,60(2):91–110, 2004.

[9] B. D. Lucas and T. Kanade. An iterative image registrationtechnique with an application to stereo vision. In IJCAI’81:Proceedings of the 7th International Joint Conference onArtificial Intelligence, pages 674–679, San Francisco, CA,USA, 1981. Morgan Kaufmann Publishers Inc.

[10] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust widebaseline stereo from maximally stable extremal regions. Im-age and Vision Computing, 22(10):761–767, 2004.

[11] N. Owens, C. Harris, and C. Stennett. Hawk-eye tennis sys-tem. In VIE: International Conference on Visual InformationEngineering, pages 182–185, 2003.

[12] Piero. http://www.redbeemedia.com/piero/.[13] J. Shi and C. Tomasi. Good features to track. In ”Pro-

ceedings IEEE Conference on Computer Vision and PatternRecognition”, 1994.

[14] S. N. Sinha, J.-M. Frahm, M. Pollefeys, and Y. Genc. Gpu-based video feature tracking and matching. Technical report,In Workshop on Edge Computing Using New CommodityArchitectures, 2006.

[15] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Explor-ing image collections in 3d. In Proceedings SIGGRAPH,2006.

[16] G. Thomas. Real-time camera tracking using sports pitchmarkings. Journal of Real-Time Image Processing, 2(2-3):117–132, 2007.

[17] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.Bundle adjustment: A modern synthesis. In ICCV ’99:Proceedings of the International Workshop on Vision Algo-rithms, 2000.

[18] T. Tuytelaars and K. Mikolajczyk. Local invariant featuredetectors: a survey. Found. Trends. Comput. Graph. Vis.,3(3):177–280, 2008.

[19] C. Wu, B. Clipp, X. Li, J. Frahm, and M. Pollefeys. 3dmodel matching with viewpoint-invariant patches (vip). InProceedings IEEE Conference on Computer Vision and Pat-tern Recognition, 2008.

A. Homography Computation and VerificationA homography of the form shown in Equation 1 does not

have the usual 8 degrees of freedom [4]. It is constrained bythe special structure of r1 and r2, which are the first twocolumns of the camera rotation matrix:

r1T r1 = 1, r2T r2 = 1 and r1T r2 = 0. (3)

Since K is known, it can be eliminated by transforming im-age coordinates accordingly. What remains is a scaling of

the first two entries of r1 and r2 by f , which leads to

H = λ

fr11 fr12 ft1fr21 fr22 ft2r31 r32 t3

. (4)

From r1T r2 = 0 follows h11h12 +h21h22 +f2h31h32 = 0.Solving for f2 leads to

f2 = −h11h12 + h21h22

h31h32. (5)

Similarly, r1T r1 = r2T r2 means that h211 + h2

21 + f2h231 =

h212 +h2

22 +f2h232. Inserting the value of f2 from Equation

5 finally leads to the desired constraint given in Equation 2.Treating the homography as a mapping fH from R2 to

R2, image coordinates x = (x, y)T are transformed as

fH(x) =1

h31x+ h32y + h33

(h11x+ h12y + h13

h21x+ h22y + h23

).

(6)

Since homographies are defined up to scale, h33 can be setto 1. Given an affine correspondence and the associatedaffine transformation fA mapping x1 to x2, the homographyH should fulfil fH(x1) = x2. By translating the sourceand the target coordinate systems such that the homographymaps (0, 0)T to itself, h13 and h23 become 0. After settingh13 and h23 to 0, the Jacobian of fH evaluated at x0 =(0, 0)T is

∂fH∂x

∣∣∣∣x=0

=(h11 h12

h21 h22

), (7)

which leads to the following first order Taylor approxima-tion:

fH(x0 + δx) ≈ fH(x0) +∂fH∂x

∣∣∣∣x=0

δx (8)

Setting this equal to the affine transformation fA(x) =Ax + t obtained from the affine correspondence leads to(

h11 h12

h21 h22

)=(a11 a12

a21 a22

)(9)

since the coordinate systems are chosen such that the trans-lation vector t is 0. Therefore the only unknowns left todetermine are h31 and h32. In addition to Equation 2 onemore constraint is needed. This constraint is provided by asecond (point) correspondence, i.e. fH(y1) = y2. Usingone of the two resulting equations leads to three possiblesolutions for H, while the second equation can already beused to verify the correctness of H. What is especially niceabout this verification is that the remaining equation whichneeds to be fulfilled is given in pixel units.

328

robust multi-view camera calibration forwide-baseline ... · tilt-zoom cameras to rail and spider...

Documents