white paper - camera calibration and stereo vision

8/7/2019 White Paper - Camera Calibration and Stereo Vision

http://slidepdf.com/reader/full/white-paper-camera-calibration-and-stereo-vision 1/14

White Paper: Camera Calibration and Stereo Vision

Peter HillmanSquare Eyes Software

15/8 Lochrin Terrace, Edinburgh EH3 9QL

[email protected]

www.peterhillman.org.uk

October 27, 2005

Abstract

This white paper outlines a process for camera calibration: computing the mapping between points in the real

world and where they arrive in the image. This allows graphics to be rendered into an image in the correct position.

Given this information for a pair of stereo cameras, it is possible to reverse the process to compute the 3D position of

a feature given its position in each image — one of the most important tasks in machine vision. The system presented

here requires the capture of a calibration chart with known geometry.

1 Introduction and Overview

This document is intended to be a tutorial which describes a process for calibrated stereo vision: it is meant as a simple

get-it-working overview, with some of the important bits of the theory “glossed over” rather than explained in detail.

The idea is that after reading this paper you should be able to implement a reliable and reasonably robust system,

rather than understand all of the theory. For detail, refer to Hartley and Zisserman’s Multiple View Geometry [4].

In this paper the terms 3D position and 3D co-ordinates refer to the actual position of an object in the real world. A

real-world object is also called a feature. When viewed through a camera, the feature appears at some position in the

image. This is refered to as the feature’s Image Point which has 2D co-ordinates measured in pixels.

The remainder of this paper is arranged as follows: virtually all the processing uses homogeneous co-ordinates, which

are explained in Section 2. The mapping between 3D co-ordinates of features and 2D positions of their corresponding

image points is given by a 4x3 Projection Matrix P . Section 3 explains this matrix and how it is used: how the matrix

is used to find where a feature in space appears in an image. To reverse the process and find the 3D position of a

feature, two different views of the point are required. This is the classic stereo vision problem and is presented in

Section 4.

These sections assume that P is known. Section 5 shows how a calibration chart can be used to find this data.

Although stereo vision techniques is best understood by presenting the algorithm in this order, it is best to implement

the paper backwards: the calibration algorithm in Section 5 first, then the 3D reconstruction algorithm in Section 4.

1



1.1 Assumed background

This paper assumes you know a little bit about linear algebra, but not much. If you’ve ever done matrix multiplication

and tried to do matrix inversion, that’s probably enough. You should have also met some kind of image processing

and have some idea of the multitude of different techniques that you can use to identify the position of a feature in an

image.

2 Homogeneous co-ordinates

In normal co-ordinates, a point in an image (an image point ) which is x pixels to the right of the origin and y pixels

below (or above) it is described as a pair (x, y) or with a vector

xy

However, in most computational geometry systems, homogeneous co-ordinates are used. An extra element w is

tagged onto the vector a

bw

for 2D and

abcw

for 3D. To convert from homogeneous co-ordinates to normal co-ordinates, simply divide a,b (and c if 3D) by w to

get x, y and z: x = a/w, y = b/w, z = c/w. Choosing w = 1 makes this process simpler: just chop off the last

element.

In this paper, co-ordinates are sometimes written as transformed row vectors such as [abw]T , since it takes less space

on the page. Similarly, homogenous points in 3D are written [abcw]T .

If you choose it to be 1, why bother with w at all? Two reasons: Firstly, it means that matrix multiplication can be

used much more effectively to manipulate points. With normal (inhomogeneous co-ordinates), a 2 × 2 matrix can

rotate or scale a point, but only about the origin: it cannot apply a translation. To achieve translation, a constant must

be added. With homogeneous co-ordinates, this is possible. Consider this case:

x′

y′

w′

=

1 0 tx0 1 ty

0 0 1

xy

1

Here, w on the right hand side has been set to 1 for convenience, which means a = x and b = y. Applying matrix

multiplication gives x′ = x + tx and y′ = y + ty with w′ = 1. Thus, a constant vector [txty]T has been added to each

point using matrix multiplication. (You should try to convince yourself that this works no matter what w is set to)

Another reason to use homogeneous co-ordinates is the ability to represent points infinitely far away: [100]T is a

point infinitely for away on the X-axis (the direction in which the x-axis points), [010]T is infinitely far away on the

y-axis. Why would you want to represent infinite points? Because points which are infinitely far away in 3D space

can appear at a fixed, finite position in an image, thanks to the projection matrix P described in the next section. The

“vanishing point” — the point at which parallel railway lines appear to converge — is infinitely far away but shows

up in an image. Stars can also be thought of as infinitely far away, but they will still be at a given position in an image.

2



3 The Projection Matrix

The projection matrix P is a mapping between the 3D co-ordinates of a feature and the features’s image point: P

is a mapping from points in 3D space (the world) to 2D space (the image). A graphics renderer applies a projection

matrix to a feature in 3D space in order find where to draw the feature in the image. A real camera effectively does

the same. That a matrix multiplication is sufficient for the job will not be explained here, but such a matrix can be

composed of the following components

• the 3D position of the camera — a translation

• the pixel pitch (which is also related to the image size) — a scaling

• the angle of view of the camera (where it is looking) — a set of rotations

• the effective focal length of the camera. This causes the 2D x and y positions to be dependent on the 3D zposition, so that points generally appear to move in the image as they approach the camera.

Since a vector describing the position of a 3D point has 4 elements including w, and a vector describing a position of

a 2D point has 3 elements, P has 3 rows and 4 columns. To map a point from 3D to 2D image point using P , weapply the following formula:

a′

b′

w′

= P

abcw

(1)

or more verbosely

a′

b′

w′

=

p11 p12 p13 p14

p21 p22 p23 p24p31 p32 p33 p34

abcw

(2)

Now you have enough information to make a 3D graphics rendering package! The following algorithm is at the heart

of most packages:

Algorithm 1 To draw 3D points in an image at the correct location:

• decide on the value of P , based on where you want the camera to be, where you want it to be looking and what

field of view you want.

• for each point p (x,y ,z) in the 3D scene

– add an element w = 1 to form a homogeneous co-ordinate vector [xyz1]T

– compute [abw]T = P [xyz1]T .

– divide by w to find the 2D image point: x = a/w and y = b/w

– draw a point at that position in the image

Of course this algorithm doesn’t do any of the fancy lighting effects or hidden surface removal you’ll need to make a

realistic image. For that, read Foley et al [2]

3



3d point

2d image of point

Figure 1: Reconstruction: The 2D point (shown as a cross) corresponds to a line of points

(dashed line) in 3D; any point along this line would project to the same point.

Thus, the 3D position of a point cannot be determined in general from one

camera.

4 3D recovery

The previous section describes how to compute the 2D position of the point given the 3D co-ordinates. How can we

go the other way? Can we work out where a point must be in real 3D space given its position in an image (its image

point)? The simple answer is you cannot, except in very special circumstances. This is because all points on a 3D

line (called the back projection line) end up at the same image point: given an image point, you know that the feature

must lie on the back projection line but you cannot tell where on the line the feature is. So, P applied in reverse maps

a 2D point to a 3D line: see figure 1

3d point

2d image of point

Figure 2: Reconstruction 2: The same point observed in two cameras results in two dif-

ferent lines which intersect at the 3D object. Hence, 3D reconstruction from

stereo cameras is possible

With two cameras, each camera gives a different 3D back projection line. These two back projection lines usually

coincide at exactly one point (fig. 2). So, given a stereo setup it is possible to find the 3D position of a point

by observing its position in two different cameras. Why “usually”? It is possible that the point is infinitely far

away (think back to the stars), and if the cameras are looking in the same direction separated only by a translation

(like binoculars), then the back projection lines are parallel, and will not (strictly speaking) intersect. However, if

homogenous points are used carefully, an “infinite point” of the form [abc0]T will be recovered, which can be used to

compute the direction of the point.

4



Recovery of the 3D position of a point proceeds as follows:

For two cameras with two projection matrices P 1 and P 2 we can have

a1b1w1

= P 1

X Y Z

1

(3)

a2

b2w2

= P 2

X Y Z 1

(4)

We know the observed image points in each camera x1,y1,x2 and y2, given by a1/w1, b1/w1,a2/w1 and b2/w1

respectively, but we cannot assume that w = 1. We wish to solve for [XY Z ]T , since this is the 3D position of the

feature.

Now, multiply out in terms of the P i, where i is 1 or 2. Here, pi11

means element 1,1 in P i

ai = Xpi11

+ Y pi12

+ Zpi13

+ pi14

(5)

bi = Xpi21

+ Y pi22

+ Zpi23

+ pi24

(6)

wi = Xpi31 + Y pi32 + Zpi33 + pi34 (7)

substituting into xiwi = ai and yiwi = bi gives

Xxipi31 + Y xip

i32 + Zxip

i33 + xip

i34 = Xpi11 + Y pi12 + Zpi13 + pi14 (8)

Xyipi31 + Y yip

i32 + Zyip

i33 + yip

i34 = Xpi21 + Y pi22 + Zpi23 + pi24 (9)

Since we want to solve this for X,Y,Z , those terms are collected on the left and other terms on the right:

X (xipi31− pi

11) + Y (xip

i32− pi

12) + Z (xip

i33− pi

13) = pi

14− xip

i34

(10)

X (yipi31− pi

21) + Y (yip

i32− pi

22) + Z (yip

i33− pi

23) = pi

24− yip

i34

(11)

We substitute for i = 1 and i = 2 to give us four equations and write them as rows of a matrix multiplication:

x1p1

31− p1

11x1p1

32− p1

12x1p1

33− p1

13

y1p131− p1

21y1p1

32− p1

22y1p1

33− p1

23

x2p231− p2

11x2p2

32− p2

12x2p2

33− p2

13

y2p231− p2

21y2p2

32− p2

22y2p2

33− p2

23

X

Y Z

=

p114− x1p1

34

p124− y1p1

34

p214− x2p2

34

p224− y2p2

34

(12)

AX = B (13)

We need to solve this for X . We’d like to invert A directly but this isn’t allowed (it isn’t even square), so instead we

play the little pseudo inverse trick. We can left multiply each side by AT , then by (AT A)−1. The inversion here is

usually OK because AT A is always square. So we have

(AT A)−1AT AX = (AT A)−1AT B (14)

5



You’ll see that the first part of the left hand side of this is a matrix multiplied by its inverse which by definition is the

identity matrix I . Since IM = M for all M , we can write

X = (AT A)−1AT B (15)

That, believe or not, is the basic equation of stereo vision. Let’s reiterate this in the form of the algorithm:

Algorithm 2

• Find P 1 and P 2, the projection matrix for each camera. This can be done “off-line” as they only change when

the camera geometry changes.

• To find the 3D position of a point:

• Locate the position of the point (x1, y1) and (x2, y2) in the image from each camera.

• Form matrices A and B from (x1, y1), (x2, y2), P 1 and P 2 using equation (13)

• Form and invert (AT A). Best to use Singular Value Decomposition to do the inversion as it gives you the

closest fit1

• solve equation (15) to give the 3D position.

There you are! Now you can find points in 3D from a stereo camera setup. At least, you can once you’ve read

section 5, which explains how to find P .

4.1 Epipolar Geometry

3d point

epipolar line

Figure 3: Epipolar Geometry: All points on the back projection line corresponding to a

point observed in the left image project to a line in the right image called an

epipolar line

Figure 3 shows the same set-up as Figure 2. Suppose an image point is observed in the left image from a 3D feature.

The exact position of the point is unknown, but it will definitely lie somewhere on the back projection line (shown

dashed). If we take this line and project it onto the right camera image, we get a line in the right camera image2.

1The OpenCV and VXL libraries contain routines for this, as does Matlab and scilab. The Numerical Recipes implementation still appears

to be broken despite several attempts to fix it2‘Projecting a line’ means taking every point on the line, and projecting it using P , then combining the projected points together to form a

line

6



This line in the image is called an epipolar line. It is an important concept to grasp: If a feature projects to a point

(x, y) in one camera view, the corresponding image point in the other camera view must lie somewhere on an

epipolar line in the camera image. An image point in camera 1 corresponds to an epipolar line in camera 2 and

vice versa.

Computing the equation for the epipolar line requires combining P 1 and P 2 to form a Fundamental Matrix F .

Details are given in Hartley and Zisserman’s Multiple View Geometry, which provides a Matlab code sample for

computing F from P 1 and P 2. It is mentioned here because it can be much more efficient to use epipolar lines. Thealgorithm would look like this:

Algorithm 3

• Find P 1 and P 2, the projection matrix for each camera and combine to form F . This can be done “off-line” as

they only change when the camera geometry changes.

• To find the 3D position of a feature:

• Locate the position of the feature p1 = [x1y11]T as observed by camera 1.

• Compute the epipolar line using e = Fp. e is of the form [abw]T where points on the line satisfy ax+by +w =0.

• Scan along e in the image from camera 2 to find the feature at p2 = [x2y21]T

• Use equation (15) to find the 3D position from p1 and p2.

This is almost twice as fast as the simpler algorithm, since scanning the whole image from each camera is slower

than scanning the whole of one image followed by one line in another. However, it does rely on the feature being

accurately located in camera 1. If the feature is mis-located, the epipolar line is likely to be wrong. Finding the best

match on this incorrect epipolar line might give a reconstructed point which is significantly inaccurate.

4.2 Errors in 3D projection

Projection of reconstructed 3D point

Located position of feature

Reconstructed 3d point

Figure 4: Reconstruction errors: With erroneously located features (red crosses), Algo-

rithm 2 will find the best fit 3D reconstructed point, which re-projects into the

images at different points (black crosses) to the located feature points. The sum

of the distances between the red cross and the black cross in each image gives a

measurement of the error.

We return to the situation of Algorithm 2, where the whole of both images are scanned. If the feature is located

precisely in both images, then the image point in camera 2 will lie on the epipolar line corresponding to the image

point in camera 1 and vice versa. If the feature is inaccurately located in both images, this is unlikely to be the case.

Another effect is that equation (13) will not hold exactly. An error in pixels can be computed once X has been found.

7



Simply take the computed point X and project it using P 1 andP 2 to find two reconstructed image points. Taking the

distance between these reconstructed points and the original found feature points gives a measure of error. Figure 4

shows the setup.

5 Calibration: Finding the Projection Matrix

5.1 Choosing a co-ordinate frame for 3D points

2D points are measured in pixels, with the origin in some given position, usually the top left corner. But what about

the 3D points? How should they be measured? Basically, any way you want. You must decide what you want the

origin of the 3D space, what direction you want the axes to be (make sure they are orthogonal though!), and you

decide what scale you want. It will depend on your application is. If, for example, you are using the system to control

a robotic arm, you will probably want to align the origin and axes with the system used to control the robotic arm.

For a forward-looking vehicle mounted camera, you could do worse than setting the origin to be the camera position,

scale to be in metres, the z axis to be along the road and y to be up. That way, the magnitude of 3D coordinates gives

you how far away the feature is.

5.2 Solving for the Projection Matrix

The 3D position of points cannot be recovered until P is found for each camera. If we have a set of features (points

in 3D) and know the corresponding image points, then we can solve for P . Let’s suppose we have a set N of points,

with C n = (X,Y,Z ) being the known 3D point and cn = (x, y) being the known image point corresponding to C n.

Let’s restate Equations (8) and (9), discarding the i superscripts and subscripts, since we are solving for each matrix

separately:

Xxp31 + Y xp32 + Zxp33 + xp34 = Xp11 + Y p12 + Zp13 + p14 (16)

Xy p31 + Y yp32 + Zyp33 + yp34 = Xp21 + Y p22 + Zp23 + p24 (17)

Moving everything to the right hand side and writing in matrix form gives

X Y Z 1 0 0 0 0 −Xx −Y x −Zx −x0 0 0 0 X Y Z 1 −Xy −Y y −Zy −y

p11p12

p13p14p21p22p23p24p31p32p33p34

= 0 (18)

C p = 0 (19)

8



Here, the 12 element column vector p is just the elements of the 4x3 matrix P stacked into a vector so we can express

equations (16) and (17) as a matrix multiplication. Once we’ve solved p, all we need to do is reshape the elements

to form P . We have two equations with 12 unknowns, which isn’t a promising start, but so far we’ve only used one

point. However, we can add more points simply by adding rows to the matrix C using more correspondencies C nand cn. Lets add subscripts to identify different points:

X 1 Y 1 Z 1 1 0 0 0 0 −X 1x1 −Y 1x1 −Z 1x1 −x1

0 0 0 0 X 1 Y 1 Z 1 1 −X 1y1 −Y 1y1 −Z 1y1 −y1X 2 Y 2 Z 2 1 0 0 0 0 −X 2x2 −Y 2x2 −Z 2x2 −x2

0 0 0 0 X 2 Y 2 Z 2 1 −X 2y2 −Y 2y2 −Z 2y2 −y2...

......

X n Y n Z n 1 0 0 0 0 −X nxn −Y nxn −Z nxn −xn

0 0 0 0 X n Y n Z n 1 −X nyn −Y nyn −Z nyn −yn

p = 0 (20)

C p = 0 (21)

If we have n points, p will have 2n rows. 0 is a column vector with 2n elements, all of which are zero.

So how do we solve for p? Since the right hand side of equation (21) is zero, we need to find the nullspace of C . We

can solve this using Singular Value Decomposition [5]. This finds a decomposition of a matrix (in this case C ) into

three matrices U W V T where W is a diagonal matrix (only the elements on the diagonal of W are non-zero). The

nullspace is those columns of V for which the corresponding element of W is (nearly) zero. So, if element W aa is

zero, or very close to zero, then column a of V is part of the nullspace.

To solve for for p, use a standard implementation of Singular Value Decomposition (for example from VXL or

Matlab), check that the elements of W are in descending order (they will be in the case of VXL and Matlab), and

reshape the rightmost column of V into P . Any problems with this “ignorant” approach will be mopped up by the

RANSAC algorithm.

5.3 The RANSAC algorithm

We have a method for finding P using some number of points. But how many points should be used? Well, it makes

sense to pick at least 6, since there are 12 elements in P , and we get two equations per point, but this is not strictly

necessary. Presumably, you will have a system that identifies points automatically in the image and somehow knows

what the real position of the point is in space. One way of getting this data is using the calibration chart reading

algorithm described in section 6. There will probably be some error in the location of the points, which will introduce

errors into C . The effect of SVD is to give a kind of least squares fit to the points provided. So, in some sense, the

more points you use the better, since this will give a better fit. But you might have an outlier: you might mistake some

speck of dust for one of the calibration chart points, which will give you at least one erroneous point. Including this

erroneous point in C will skew the results and introduce errors in P . The answer is to use only the accurate points.

The outliers are difficult to detect but the RANSAC (Random Sampling Consensus) [1] algorithm provides a brute

force solution. RANSAC repetitively chooses a random subset of the points (hopefully one without outliers), solves

for P using just those, then looks a how good the solution is. Given perfect input data, P will map all the 3D points

to the located 2D image positions. Of course, a few won’t match because there are outliers, and the rest won’t be

in exactly the right position because of minor location errors. So, we say that the best solution for P is the one that

reconstructs the most 3D points to a position close to the located 2D positions.

Written algorithmically:

Algorithm 4

9



• Given a list N of n point correspondences between 3D co-ordinates and the corresponding 2D image points,

do the following for several iterations:

– pick x point pairs out of N to form C

– Solve for P using SVD

– For each 3D to 2D point correspondence C ⇒ c in N

∗ project C to the image point c′

using c′

= P C ∗ If c′ is fairly close to c, count as a good match

• Set the final P to be the one that has the most good matches

Here, a “good match” means the Euclidean distance between the c and c′ is less than a parameter. This parameter

should be proportional to the expected error in inlier matches. That is, look at the output from your point correspon-

dence finding algorithm, throw away the obvious outliers, and measure the errors in the remaining correspondences.

So how many points x to pick each time? The higher the error in the inliers, the more points you need as the more

you need the best fit to smooth out your errors. The maximum value you should use is n − r, where n is the number

of correspondences and r is the maximum number of expected outliers. If you set it higher than this, then you will

always have an outlier in the matrix C and you will be doomed. The more values you use, the more iterations of RANSAC you should use, since it will take it longer to stumble upon a group of matches that don’t contain outliers.

5.4 Normalisation

Before you reach for your keyboard an implement all this junk, an important point to note. Suppose you want measure

the real world co-ordinates in metres, and have a fancy high resolution camera. Columns 1-8 of C would then be of

the order of 0.1 to 1, 9-11 of the order 1-100, and column 12 of the order 100-1000. This is bad, because the least

squares action of SVD will be skewed by these large numbers in the last column. Hartley and Zisserman suggest (and

rather emphatically insist upon) a normalisation process to make sure that each column are of the same order: before

computing the matrix C , find the mean of all 3D points and remove this mean. Now scale these points so that thevariance is 1. Do exactly the same to the 2D points c. The projection matrix P ′ found from these scaled points needs

to be pre- and post-multiplied so that it can work with un-normalised values.

P =

σx 0 x

0 σy y0 0 1

P ′

σX 0 0 X 0 σY 0 Y

0 0 σZ Z 0 0 0 1

(22)

where x and y are the mean x and y positions of the reconstructed image points and σx and σy are the variances of

the image points. X , Y and Z are the mean x y and z positions and σX , σY and σZ the variances of the features.

6 Reading a calibration chart

A calibration object is some object with identifiable features, where the position of the objects are known. The idea

is you know where the features are in 3D space, and you can identify the corresponding image points, so you have

the pairs of C i and ci required to compute the projection matrix P . It is common to use an array of squares on a flat

piece of paper. This is actually a bad idea, since you only solve for points on two out of the three axes. Better to have

identifiable features on two pieces of paper 90 degrees apart. This is harder to make because the angle needs to be

correct, but the improved accuracy pays dividends.

10



Figure 5: Calibration chart of 4cm squares on two sheets of paper 90 degrees apart. The

y axis is the normal to the bottom sheet, the z axis normal to the top sheet. The

x axis is along the fold line. The origin is on the fold half way between the two

centre squares.

Figure 5 shows an example calibration chart. Note that it has one major problem: you can’t tell which way up the

camera is. If you were to turn the camera upside-down, the projection will be inverted. Putting unique identifiers in

the chart (like a bright red square in one corner) will go a long way to solving this problem.

Now, we can start searching and building correspondences. Writing this robustly is difficult, since the pixel spacing

of the squares is unknown and there can be any amount of rotation. A feature detector will do a reasonably good job

of finding the corners of the squares, but it will also pick up the corners of the chart and anything else “corner-like”

in the image. Note you really don’t need to understand the following algorithms: feel free to skip to the next section.

There are many ways of reading a calibration chart and this – my own invention – has advantages and disadvantages.

If you need to understand a calibration chart, this algorithm might give you some ideas.

The first step in reading a calibration chart is to locate where in the image the corners of each square of the calibration

chart are. Then, the squares are identified using their layout in the image.

This approach uses quadrants: Whatever the rotation of the corners, one corner will lie in each quadrant of the square:

one in the top left, one in the top right, one in the bottom left and one in the bottom right, with respect to the centre

of the region. Figure 6 shows the process.

Algorithm 5

To locate the positions of the corners of each square in the image:

• Threshold the image at the average intensity of the centre portion of the image (assuming this is where the

chart is)

• Region grow (also called connected components analysis) on all black regions in the thresholded image, to find

the centre and size of each black region. Filter out any region which is far too small or big to be a black square.

The output is a region image: each pixel in this image indicates the region number that the pixel belongs to in

the original image. A region number of zero indicates that the point is white or belongs to a region that is too

11



Centre of region

arr s corner

Figure 6: Example of corner finding: The grey area is the square found by the region

finder. The region is divided into quadrants which meet at the centre of the

region. The filled circles are Harris features, marked as corners of this square.

The hollow circle is a Harris feature rejected as a corner of the square since it is

not the furthest corner in its quadrant from the region centre

big or small.

• Run a Harris feature detector [3] and select the 144 strongest points (there are 96 corners so there are 50%

more points in this list than expected corners). Call this list of points H .

• Dilate the region image by one pixel (to make sure that the located corner points lie within a region)

• For each corner h in H (with position hij):

– Identify to which region each corner point hij belongs by reading the pixel ij from the dilated region

image. Let R be the region to which it belongs and xy be the centre point of R

– Identify to which quadrant the point belongs based on whether i is less than or greater than x and j is less

than or greater than y

– Mark this point as being in the appropriate quadrant of square R.

• If any quadrant of any region has more than one point, set the corner to be the furthest point from the centre of

the region, using the city block distance d(a, b) = |ay − by| + |ax − bx|

At this point, we have (hopefully) identified the exact position of the corner of each square in the image, and using

the quadrant system we have identified which corner is which. As the Harris feature detector might fail to locate the

corners of some of the squares, some of the quadrants might be empty. Now comes the really tricky bit: working

out which square is which. Because some quadrants might be empty, some points might have to be extrapolated

(predicted).

Algorithm 6

• Find the top-left most region in the image with all corner points marked in each quadrant. Assume (to start with

at least) that this is the top left square of the calibration chart.

• output the correspondence between the 3D position of the top left square and the image position of the centre

of the square.

• predict the location of the next square (see Fig. 7) If a,b,c,d are the corners, the corners of the next square will

lie close to 2b− a,3b− 2a, 2d− c,3d− 2c.

• Find the square s closest to the predicted location.

12



a

cd

b

Figure 7: Identifying which square is which

• If any corners are missing from s, use the predicted location for the corner, otherwise use the actual location.

• output the correspondence between the 3D position of s and the image position of the centre of the square.

• continue until the end of the row, predicting the position of each next square from the previous one. Then do

the next column, predicting the first square of the next row from the top left square.

The centre points are used for correspondences since they are more predictably located than the corners.

If too many points needed to be predicted rather than read from the actual location, the chances are that the regionassumed to be the top left of the chart was wrong. So, pick the next top-left most region and see if you need to predict

fewer points.

Note that the first row of black squares on the bottom sheet starts in the fold. This makes it easier to predict the

position of these squares.

7 Recommended exercise: Your first test

The rather convoluted algorithms of section 6 give reasonably accurate results if you can get them to locate most of

the regions correctly. To start with, set up a couple of cameras on tripods, make a calibration chart like the one in

Fig. 5, and photograph it. Then locate (by hand if you don’t fancy implementing section 6) the corners of the squares,

and output them and the corresponding 3D positions of the squares. Taking the origin as half way along the centre

fold means the top left corner of the top left square will be (-14,24,0) and the bottom right corner will be at (14,0,-20),

with the z axis being depth away from the camera and the scale in cms.

Feed all those correspondences into the RANSAC algorithm to find P for both cameras. You can confirm that it has

worked by drawing the reconstructed positions of all points on top of the image.

Now you’ve got P 1 and P 2, put some object in the scene, measuring its 3D position relative to the calibration chart

origin. Photograph again, and feed the pixel positions into the Stereo algorithm. If you get roughly the right 3D

position out, you know you are on the right track!

13



Shameless plug

I (the author of the paper) am a freelance researcher and developer, working on problems like this, and specialising in

development of plug-ins for visual media post–production/digital special effects packages like Shake and Maya. I’d

be more than happy to advise with questions you have with any aspect of stereo vision or any other image processing

problem, and I’d be much more than happy to take some money off you to develop specialised imaging software for

any application.

Refer to my website, http://www.peterhillman.org.uk/ for more information and more shameless plugs

References

[1] H. Cantzler. Random Sample Consensus (RANSAC). http://www.inf.ed.ac.uk/.

[2] James Foley, Andries van Dam, Steven Feiner, John Hughes, and Richard Phillips. An Introduction to Computer

Graphics. Adison Wesley, 1993.

[3] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988.

[4] Richard Hartley and Andrew Zisserman. Multiple View Geometry. Cambridge, second edition, 2003.

[5] William H Press, Saul A Teukolsky, William T Vetterling, and Brain B Flannery. Numerical Recipes in C: The

Art of Scientific Computing, chapter 2.6: Singular Value Decomposition, pages 59–70. Cambridge University

Press, second edition, 1992.

14

white paper - camera calibration and stereo vision

Documents