center for machine perception stereoscopic matching...

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY

RESEARCH

REPO

RT

ISSN

1213

-236

5Stereoscopic Matching:Problems and Solutions

PhD Thesis Proposal

Jana Kostkova

[email protected]

CTU–CMP–2002–13

September 4, 2002

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/kostkova/Kostkova-TR-2002-13.pdf

Supervisor: Dr. Radim Sara

This research was supported by the Grant Agency of the Czech Re-public under project GACR 102/01/1371, by the Czech Ministry ofEducation under project MSM 212300013 and by the Grant Agencyof the Czech Tecnical University under project CTU 0209113.

Research Reports of CMP, Czech Technical University in Prague, No. 13, 2002

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Contents

1 Introduction 11.1 Binocular Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 State-of-the-Art 42.1 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Matching Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Stereo Algorithms Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Proposed Methods 213.1 Disparity Component Matching . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Disparity Component Algorithm . . . . . . . . . . . . . . . . . . . . 243.2 Stereo Algorithms Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Experiments 294.1 Evaluated Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Evaluation Based on our Methodology . . . . . . . . . . . . . . . . . . . . 314.3 Evaluation Based on Scharstein & Szeliski’s Test Dataset . . . . . . . . . . 334.4 Comparison of Evaluation Methodologies . . . . . . . . . . . . . . . . . . . 35

5 Thesis Plan 37

i

Abstract

This thesis proposal addresses the stereoscopic matching. An introduction to thecorrespondence problem is given at the beginning. Then, State-of-the-Art in thefield of binocular stereo matching is presented. Our previous research focusing thisproblem is described together with its classification among recent stereo matchingapproaches. Finally, the thesis goals are formulated.

1 Introduction

The traditional problem in computer vision is reconstruction of a 3D model of a scenefrom a set of 2D images capturing the scene from different points of view. Having one 2Dimage and the centre of projection of its corresponding camera, we can construct a light-raygoing through camera centre and the selected pixel in the image. On this light-ray, thespatial point, represented by the selected pixel in the image, lies. At least two images arerequired in order to reconstruct this 3D point; the light-rays intersect in the desired spatialposition—cf. Fig 1.

+ +C 1 C 2

2pp 1

P

Figure 1: Scene geometry: C1, C2 are centres of left and right camera, respectively. The projection of aspatial point P into the left and right images is depicted by points p1 and p2, respectively.

Having two pixels, each of them in a different image, which are projections of thesame spatial point, this 3D point can be easily reconstructed using the camera light-raysintersection. Such two pixels create a corresponding pair, which is an basic element forthe scene reconstruction process. However, establishing correspondences between two (ormore) images is a difficult task.

The field of stereo matching focuses exactly to the task of finding these correspondences.It usually assumes, that the camera geometry is known (or calibrated [27]), and thus theepipolar constraint can be applied. This constraint guarantees, that the correspondingpoint of a point from one image lies in the corresponding epipolar line in the other image,and vice versa. This situation is demonstrated in Fig. 2. The epipolar line is the intersectionof the plane going through the point we find the correspondence with, and the baseline—the line connecting the cameras’ centres of projection, with the projection plane of thesecond camera—the second image.

1

C 1 C 2

+ +2e1e

P

Figure 2: Epipolar geometry: corresponding points lie on the respective epipolar lines (e1, e2) in the otherimage. The epipolar line is the intersection of (1) a plane given by the point in question and the twoprojection centres , and (2) the projection plane of the second camera—the second image.

Using the epipolar constraint, the images can be rectified, i.e. transformed so that theepipolar lines coincide with the corresponding image rows. This transformation allows toreduce the establishing correspondences process from 2D task to the 1D task, where onlythe corresponding image rows are examined.

Once having the correspondences established, the 3D scene model, or to be more preciseits depth, can be recovered. The depth is (approximately) inversely proportional to thedisparity, which symbolizes the shift (in pixels) of the corresponding points between thetwo images. Consequently, assuming the rectification, the disparity between a matchedpair of points from the left image IL and the right image IR, respectively—IL(u, r) andIR(v, r)—is defined as follows:

d(u, r) = u− v. (1)

In order to reconstruct 3D coordinates of a spatial point from the disparity map, thecamera geometry (especially the focal length and the baseline) has to be known. Assumingthat together with the rectification (i.e. parallel axis, the same focal length, and no rotationof projective planes), the 3D coordinates of a point P, which correspond to the matchedpair IL(u, r) and IR(v, r), can be expressed as follows:

x = b·ud(u,r)

, y = b·vd(u,r)

, z = b·fd(u,r) (2)

where P = [x, y, z], b is the baseline between cameras centres of projection and f is thecamera common focal length.

1.1 Binocular Stereo Matching

The core of stereo matching problem is to establish correspondences—matching pairs, be-tween the input images. Assuming the rectification, the matching pairs are picked outbased on a defined criterion from all tentative pairs formed among corresponding image

2

rows. However, the stereo matching problem is an ill-posed task, because its solution isnot generally unambiguous (except in some special situations). In our opinion, even if thescene was continuous and everywhere visible (with no half-occlusions), the matching wouldremain ambiguous due to two reasons:

1. Data uncertainty—due to the low signal-to-noise ratio in poorly textured image re-gions, and

2. Structural ambiguity—due to the periodic structures in the image.

The data uncertainty is never totally avoidable, it could be only partly reduced by usingbetter matching cost measures, which are more discriminable. On the other hand, thestructural ambiguity can be significantly reduced by using more cameras observing thescene from various points of view, because the polynocular case de-ambiguate the corre-spondence problem [48, 13, 52, 35].

In our research (and consequently in this survey as well), we focus on binocular stereomatching, i.e. the matching problem is solved using only two input images. To be ableto compute correspondences using only two images correctly is highly thorny challenge.Reduction of the number of cameras to its feasible minimum, allows to study in detail thecomplete correspondence problem and discover its limitations.

In binocular stereo vision, due to a different view angle of left and right camera, re-spectively, (slightly) different parts of the real scene are captured as the input images.According to the visibility, the scene can be differentiated into two parts:

• binocularly-visible areas: consisting of points visible in both, left and right, cameras,

• monocularly-visible areas: consisting of points visible only in one camera.

The areas of binocularly-visible points usually do not cause any special problems, ex-cept for regions of low texture, repetitive patterns, or specularities, and correspondencescan be established there. However, the areas of monocularly-visible points, the so calledhalf-occluded regions, are present only in one image, and consequently, they have no corre-spondences. Therefore in points of these areas no correspondences should be determined.

The scene configurations with half-occluded regions are shown in Fig. 3, the occlusionsare depicted in red. The origin of half-occlusions lies in depth discontinuities relative tothe cameras. In general images, two kinds of discontinuities may exist: scene discontinuity(demonstrated in Figs. 3(a), 3(b)), and surface discontinuity (demonstrated in Fig. 3(c)).The scene discontinuities are created at object boundaries (e.g. due to thin objects, orholes at the foreground), while surface discontinuities are caused by surface creases.

Explicit detection and proper recognition of occlusions is important for the global re-construction process, because the spurious correspondence assigned in these regions maylead to completly erroneous results. Therefore, precise detection of half-occluded regionsis highly desired.

The aim of all stereo matching approaches is to design an algorithm which establishesthe correspondences correctly. From our point of view, “to establish correspondences

3

C1 C2

(a)

C1 C2

(b)

++C1 C2

(c)

Figure 3: Scene configurations with half-occluded regions (red highlighted): (a) occlusions due to thinobject at the foreground—scene discontinuity, (b) occlusion due to a small hole at the foreground—scenediscontinuity, (c) occlusion due to surface variation—surface discontinuity.

correctly” means (1) to compute accurate (error free) disparity in pixels which are to bematched—binocularly visible pixels, and (2) to detect half-occlusions correctly. We preferobtaining sparse disparity map with no errors to dense disparity map having various kindsof errors. Whilst erroneous data (even if dense) may lead to spurious reconstruction, sparse,but correct data do not exhibit this artifact. Nevertheless, we want our resulting disparitymaps to be as dense as possible.

2 State-of-the-Art

The stereo vision has traditionally been one of the most investigated topics in computervision, what resulted in a very large number of publications dedicated to this topic. Due tothis enormous quantity of articles, to propose any taxonomy of algorithms is a very hardtask.

In this section, first, we try to create such algorithms’ taxonomy. We introduce theclassification of the most important stereo matching approaches together with their de-scription. Then, we summarize the existing attempts to stereo algorithms evaluation.

2.1 Stereo Matching

The stereo matching problem has been addressed for more than four decades [31]. How-ever, the beginning of the rigorous stereo vision theory can be considered in the workof Marr & Poggio [43]. Marr has identified the natural ambiguity of the correspondenceproblem (described in previous section) and defined two main constraints allowing preciseformulation of the matching task:

1. Uniqueness: the point in one image can be matched at most to one point in the otherimage (resulting from the assumption that one spatial point can be represented inone image by at most one point),

4

2. Continuity: the disparity varies smoothly almost everywhere (resulting from theobservation that the matter is cohesive, separated into objects, whose surfaces aregenerally smooth).

Marr & Poggio have proposed a framework incorporating the procedure of humanstereopsis, which has been modified for computational stereo problem. Since they thoughtthe pixel intensity can vary even in corresponding pixels, but the sharp changes in intensitycan be detected precisely, they proposed an approach based on matching features—locatedat positions of sharp intensity changes: edges, corners, etc. All the features are describedby its signatures (representing the features characteristics) and the correspondences areestablished based on the selection of the best matching signatures. Typically, only thefeatures are matched producing sparse disparity maps. The Marr’s cohesivity principle(based on the surface continuity assumption) have been improved in the PMF algorithmdesigned by Pollard, Mayhew and Frisby [50], where the cohesiveness is represented by thecondition of disparity gradient limit.

In early 90’s, the researches focused on a modification of Marr’s theory, where dis-parity would be estimated in all pixels directly. In this moment, the research in stereomatching has been divided into two main directions: the first direction—global methods—re-formulated the task such that the disparity function and all the assumptions about thestructure of the world are modeled by Bayesian variables and the disparity is estimatedusing the Bayesian framework. The second direction—local methods—continued (after asmall pause) in Marr’s theory and focused on improving the correspondence selecting tech-niques and increasing the discriminability of matching element signatures.

Global methods need to define the prior model of the world first. The proper selectionof the model determine the final solution, and that is why it is a key problem for thesemethods. The solution is found by minimizing a global function, usually consisting of twoterms: one represents the world model with all its assumptions, while the other measuresthe consistency with input images.

The first global formulation has been proposed by Cox et al. [15]. They formulated thetask as the Bayesian framework solvable by maximum likelihood algorithm with uniformprior probability distribution, which results in maximum a posteriori (MAP) estimate. Thesolution is found via dynamic programming separately for each scanline. Then, the formu-lations modeling the real world more precisely (assuming occlusions and discontinuities)have been proposed in [1, 20], where the solution is found based on classical MAP leadingto the dynamic programming solution searching as well.

Dynamic programming algorithm is able to find the optimal solution for each scanlineseparately, which in general does not preserve any inter-scanline smoothness consistency.The approach incorporating the over scanline smoothness assumption has been introducedin [47], where the intra-scanline and the inter-scanline searches are combined. Dynamicprogramming is performed to find the optimum in both “directions” simultaneously and foreach direction separately. However, some global 2-D optimization seemed to be required.

This demand results in a graph labeling formulations of the correspondence problemsolving the whole 2-D optimization at once. The aim is to find the labeling of the graph

5

vertices (representing usually the image pixels) by such labels (representing the disparity)that it is consistent with assumptions of the world model. Two formulations have beenintroduced: maximum-flow [52, 51, 30] and minimum-cut [7, 9, 33]. The key problem ofthese methods represent (besides the proper definition of the global function) the selectionof appropriate graph structure and scoring its edges.

To conclude the pros and cons of global methods we have to emphasize the formulationclarity of the correspondence problem. However, it has also many inconveniences. Thetime complexity of the global approaches is high (for graph formulation it is NP-hard [36]),and consequently the feasible approaches have to be only some restriction of the originalformulation. The overall success depends on the precise definition of the scene modelincluding all the assumptions (smoothness, uniqueness, occlusions, discontinuities) and thebalance between the model and the data consistency. Typically, in image low-texture areasthe scene model wins over data resulting in the interpolation effect in between boundariesof these areas—the typical thin streaks. However, this interpolation is almost alwayserroneous, mainly in the scenes of deep depth and thin objects at the foreground.

Local methods have been supplanted by global methods at the beginning of 90’s. How-ever, at the end on 90’s the researches have returned to the Marr’s concept (due to theglobal methods problems) and have been trying to upgrade its ideas. They come out fromthe assumption that no global optimization is needed, on the contrary, the local imageinformation is sufficient for correct formulation of the matching task.

In the local methods, the matching is performed based on selecting the most similarmatching elements. Each matching element has assigned a signature, which describes itscharacteristics. Over the signatures a statistics, evaluating their similarity, is defined,which allows matching pairs selection. The cost of the signature depends only on thelocal image properties (in predefined neighbourhood/window) of the matching element.Therefore, the core of local methods lies in the definition and selection of the appropriatewindows—representing the signatures, and the statistics computed over these signatures.

The typical statistics defined for computing matching costs are sum-of-squared dif-ferences, sum-of-absolute differences, normalized cross-correlation, etc. Based on thesestatistics, the correspondences are selected. The principle of selecting the matching is verysubstantial, because it straightforwardly determines the quality of results. Consequently, itis very important to design a method based on a good logical basis, not just on some heuris-tics “which works well”. There exist different principles for selecting the matches, such aswinner-takes-all (WTA) [48, 65], or Stable matching [54, 55]. In the WTA approach, foreach point in a reference image is (somehow) selected the best tentative pair (the winner).However, in Stable matching, the symmetric best unique solution is identified.

In order to produce accurate results, selection methods require to obtain discriminablesignatures. The accuracy of results should improve as the discriminability of signaturesincreases. Therefore, it is very important to define appropriate windows (in its size orshape) for evaluating the matching costs. The problems with using standard squaredwindows arises at objects boundaries (where each camera captures a slightly differentscene) or at surfaces with big disparity variations. Approaches proposing adaptive win-

6

dows [32, 8, 69], shiftable windows [4, 20, 49], windows based on connected constant dis-parity components [6], or windows based on image segmentation [70, 67] have tried to copewith these problems.

The recent approaches have noticed that (due to various image distortions) to modelscene patches in the input images is the inconvenient direction. On the contrary, theypropose approaches, where these patches are represented in the (discrete or continuous)disparity space (image), which is a set of all tentative correspondences, and thus the patchescorresponding to the real scene structures must exist there [65, 70, 40]. Based on this“spatial” modeling, the matching cost should be more reliable and discriminable.

Another issue oriented to the windows is incorporating the smoothness assumption intothe problem formulation. Due to the local nature of these methods it is a difficult task, usedto be solved by so called “aggregation step”. The matching cost is improved based on thecontributions of points from a defined neighbourhood corresponding with high probabilityto the same scene patch: support region. A support region can be represented by variouskinds of adaptive windows, limited disparity difference [24], limited disparity gradients [50].Different ways of aggregation are iterative diffusion [59] or belief propagation [63].

The local methods are in general very fast, they work only on a small area of an image.They do not produce any artifacts as global methods, however if the selection method isnot properly formulated they could produce spurious results. The smoothness constraintis usually not directly incorporated into the algorithm, but it is additionally enforcedby aggregation over support regions. The main problem lies in definition of adequatewindows for signature representing. This problem is never totally avoidable, it could beonly suppressed by using some adaptive approaches.

After the survey of a historical progress in binocular stereo matching, let us now focuson detailed description of various approaches. First, we introduce the elements, whichhave been or still are used for matching. Then we present the most important matchingalgorithms.

2.1.1 Matching Elements

In order to compute correspondences in between two input images, the essential is to definethe elements which are to be matched. On these elements the matching cost is evaluated.Having computed the costs, the matching problem is solved and consequently, the disparitymap can be recovered.

In stereo matching approaches, two main classes of matching elements can be differen-tiated: features, and pixel intensities. The features are typically the easily distinguishableelements or regions in the input images and the matching is performed only on these fea-tures. The methods using features belong to a group so called feature-based stereo. Onthe other hand, the methods processing pixel intensity values perform matching over thepixels and usually establish correspondences in (all) pixels. They belong in a group socalled area-based stereo. This division is more or less historical and does not reflect themodern view well. First of all, a pixel is also a feature. Second, there has been various fea-

7

tures proposed in recent work on wide-baseline stereoscopic matching that have interestingproperties [44, 68]. The features need not to have any typical geometric interpretation,they may be thought of as image primitives that possess the property of detection stabilityand have some (required) invariant properties.

The historical development in matching elements we try to describe now. Marr & Pog-gio [43] proposed an algorithm based on matching image features, because they thoughtthat while the pixel intensities can be modified easily, the high intensity changes (repre-senting object boundaries, etc.) should be present in both the images and detected withoutproblems. Since the feature-based algorithms match only features, and consequently theresulting disparity maps are very sparse, the algorithms establishing correspondences inevery pixel directly were highly desired. Therefore the research concentrated on using theimage pixel intensities, resulting in so called area-based stereo.

Feature-Based Stereo

Marr defined features at positions (in the image), where intensity function is sharplychanged (edges, corners, etc.), because he proclaimed, that this difference in the inten-sity is preserved in both the images, while the pixel intensity can be modified very easily(due to various geometric distortions, image sampling, etc.). The features are typicallyextracted using various points of interest detectors [10, 42, 62, 26].

The matching is performed only in between the derived features. Consequently thedisparity is directly evaluated only in the feature positions. That is why the results arevery sparse and for binocular stereo-matching purposes inconvenient. In order to obtaindense disparity maps it is necessary to apply some post-processing, such as interpolationover the smooth areas (the general assumption is cohesivity). However, how can we decidewhich features belong to a boundary of the same spatial object? Or, which curvaturehas the original surface to be interpolated? Therefore, such post-processing could producevarious errors. Another problem of these methods arises in determining of features. Thisprocedure can be easily defective and thus the whole matching process incorrect.

Due to the described reasons, in 80’s the research in binocular stereo vision has focusedon investigation of the methods producing dense disparity maps in a straightforward way.Thus in this survey, we do not concentrate to the description of various feature-basedapproaches separately, but refer to Dhond’s paper [17], where a good and detailed reviewis presented.

Although the researches have left the examination of the feature-based methods forbinocular matching purposes, the interest in it is dusted off in different research fields,such as wide-baseline stereo. The issue of wide-baseline stereo is to estimate the camerageometries from the given set of images. For this application, the sparsity of the results doesnot matter, and the features need not to be interpreted in any way. Naturally, the featuredefinition differs from the Marr’s: e.g. in [44], the features are defined as distinguishedregions—the most discriminative image parts from their neighbourhoods.

8

Area-Based Stereo

Matching elements for area-based methods are the individual pixels over which the match-ing cost is evaluated. Methods concentrate on finding pixel-to-pixel correspondences re-sulting in dense disparity maps. The matching is based just on the image intensity functionand consequently it is not required to determine any features.

The matching costs are evaluated based on similarity statistics. The simplest statisticsis to compare just pixel intensity values [15, 13, 14], which is very sensitive to noise and var-ious image distortions. More robust approaches use for comparison windows (of predefinedsize) containing the image neighbourhood of processed pixels. The most common match-ing statistics are for instance sum-of-squared differences (SSD) [14, 49, 65], sum-of-absolutedifferences (SAD) [23, 28], normalized cross-correlation [45, 11, 55], or rank methods [2].

Area-based methods have a disadvantage over feature-based methods in that they com-pare the intensity function in pixels, which could vary due to various distortions (even in thecorresponding positions). The main problem causing artifact used to be image sampling.Elimination of this factor is discussed in following paragraphs.

Image sampling problemDue to image sampling, the simple evaluation of the matching cost over a defined win-

dow can assigned a bad value to two corresponding pixels, even in images containing nodegrading effects. A few papers [3, 65, 12] addressing this problem have been published.

In [3], Birchfield & Tomasi have proposed the method insensitive to image sampling.The defined dissimilarity measure uses the linearly interpolated intensity functions sur-rounding the pixels in both input images. The matching costs are evaluated on the integerpixel positions and at the half-pixel positions, where the image function is interpolated,symmetrically to each other. As the resulting matching cost, the best cost is assigned.

The method inspired by [3] has been proposed by Szeliski & Scharstein [65]. Theyalso use the interpolation of the input images in order to obtain continuous function (andthus decrease, or eliminate, the sensitivity to sampling). Unlike [3], the matching cost isevaluated in continuous disparity space image (DSI).

The most recent approach has been presented by Clerc [12]. This method is not basedon any image interpolation, but represents the continuous image signal by its waveletcoefficients computed in each image pixel. Consequently, the matching cost is evaluated onthe resulting coefficients. This method allows to correctly match even repetitive structures,which indicates a promising way for further research.

2.1.2 Matching Algorithms

The core of the computational stereopsis lies in establishing the matching in betweenmatching elements. The quality of results is directly determined by the matching algorithmused to solve this problem. Due to our interest in dense stereo approaches, we focus onalgorithms using as matching elements the individual pixel intensities.

9

The matching algorithms can be classified into following two main groups: global meth-ods and local methods. Global methods optimize some global (energy) function, while thelocal methods use only a small areas/neighbourhoods surrounding pixels. Besides thesetwo main categories, there exist various iterative or cooperative algorithms. These algo-rithms usually perform only over small areas, however, they compute a global improvementof these areas. Since the principle of these methods lies on “areas” and their processing,we classify these methods into local methods category.

In the following paragraphs, we introduce separately both the main groups—global andlocal methods— together with detailed description of significant approaches.

Global Methods

The core of global method approaches lies in a correct definition of the scene model. Thematching problem used to be formulated as an energy-minimization problem, where theenergy function is defined as:

E(d) = Edata(d) + Esmooth(d), (3)

and the goal is to find a disparity function d that minimizes this energy.The data term, Edata(d), measures how well the disparity function correspond to the

input image pair:

Edata(d) =∑p∈I

M(p, d(p)), (4)

where p are pixels from the input image and M(p, d(p)) is the matching cost function.The smoothness term, Esmooth(d), incorporates the smoothness assumptions defined by

the scene model. It usually penalizes the differences between neighboring pixels’ disparities,

Esmooth(d) =∑

{p,q}∈N

V{p,q}(d(p), d(q)), (5)

where p, q are image pixels, N is a set of pairs of adjacent pixels representing neighbourhoodrelation and V{p,q}(d(p), d(q)) = |d(p)− d(q)| measures the disparity dissimilarity.

The smoothness term ”forces” the resulting disparity surface to be totally smooth, whilethe data term ”preserves” the correspondence with the input images. Mutual balancing ofthese functions is crucial for the resulting disparity map accuracy and correctness.

Once the energy function has been defined, various algorithms can be applied to findits minimum. Some methods find minimum separately for each scanline (via dynamic pro-gramming), and thus produce minimization of 1D function [15, 1, 20]. Another methods(graph formulations) minimize directly 2D function representing disparity surface [52, 30,9, 33].

In the following paragraphs, we introduce the two main groups of minimization func-tions: first based on max-flow or graph-cuts formulation, and the second on dynamic pro-gramming.

10

Graph-cut and Max-flow formulationsMethods of this class formulate the matching problem as a graph labeling problem. The

aim is to find the consistent labeling (representing disparities) of the given graph (repre-senting image pixels) under the assumptions of the world model. However, to solve thisproblem generally is NP-hard [34]. Nevertheless, there exists a special class of formulationssolvable in polynomial time [36].

Widely used two feasible formulations of this problem are: max-flow and min-cut.Although they are dual problems and max-flow is usually solved by finding a minimalcut in a graph, they define a graph structure differently and consequently the min-cut aswell. However, in general, the cut of a graph divides its terminals into disjunct subsets ofvertices (each subset for one terminal), which conjunction is equal to the complete set ofgraph vertices. The graph cut consists of graph edges connecting the disjunct subsets. Thecosts associated to edges from the cut contribute to the total graph cut cost. The goal ofminimum-cut problems is to find a cut with minimal cost.

The max-flow formulations think of the correspondence problem in this way: Letscreate the spatial disparity surface in a disparity space, where for each image pixel (x, y)the third coordinate represents all the plausible disparities. Consequently, the graph isusually defined as follows: vertices correspond to 3-D mesh grid representing disparityspace image, where the vertices are 6-connected (enforcing the smoothness constraint),while the edges represent the selected model structure. Costs associated to individualedges stand for the energy function terms. The “best” disparity surface should be foundat the most weak edges positions. Therefore, the problem is formulated as finding themin-cut, and to the vertices of the graph structure a sink s and a source t are added andthe sink is connected to the first disparity level, while the source to the last disparity level.Based on this graph, the minimal cut dividing its vertices into two subsets S and T , whereT = V \S and s ∈ S while t ∈ T is computed. The disparity corresponding to the positionsof a min-cut is assigned to each pixel as its resulting disparity.

The min-cut formulations define the problem as the typical graph labeling. Having aset of pixels and a set of labels (representing plausible disparities), the problem is: whichlabel to assign to each pixel to keep the consistency with prior assumptions? Consequently,the graph structure is as follows: vertices correspond to image pixels and labels. Pixels areconnected to all the labels and between the pixels the neighbouring structure is preserved.The edges represent the selected model. The costs associated to individual edges describethe energy function terms. The algorithm, however, must be restricted to a special case(due to the complexity of the general task). The feasible algorithms find the solution itera-tively, while in each step only two labels are taken into account. In each iteration, differentpair of labels is examined. The aim is to reorganize iteratively the labels assignment to thepixels in order to decrease the global function cost. The min-cut is used in each iterationto find new reorganization in selected labels assignment—to each pixel, the label whichhas been separated from it by a cut is assign.

Let us now focus on individual approaches.The first maximum-flow formulation of stereo matching problem has been proposed

by Roy et al. [52, 51]. The graph is constructed as in general max-flow formulations: a

11

3D mesh of (x, y, d) points, where (x, y) are image coordinates and d plausible disparity,plus a source and a sink. Internally the mesh is 6-connected and the source is connectedto the front plane, while the back plane is connected to the sink. The costs associatedto the edges represent the energy function. In the energy function, the balance betweeninfinite smoothness and maximal discontinuity is driven by a smoothness parameter. Onthis defined graph, the min-cut, separating the sink and the source, is found. The depthmap is constructed from the min-cut as follows: for each point (x, y), the largest disparitywhich associated edge belongs to min-cut is selected.

Ishikawa & Geiger have presented in [30] another max-flow formulation algorithm. Theapproach guarantees ordering and uniqueness constraints. In order to explicitly model dis-continuities and occlusions, the model includes geometric constraints which require mutualcorrespondence between disparity discontinuity along the epipolar line in one image andoccluded region in the other image. The constructed graph is a slight modification of ageneral definition: a 3D mesh of doubled all possible matches (to model discontinuitiesand occlusions) (x, y, r), where, assuming the rectification, (x, r) are coordinates of a pixelin the left image, while (y, r) in the right image, and the source and the sink. The edgesare oriented and represent the terms of defined model—including matching, occlusion, dis-continuity, and constraint edges. The model encourages smoothness across epipolar lines.The smoothness coefficients are adjusted according to the edge and junction information.The solution is find by min-cut, which separates the source and the sink. The optimalmatching corresponds to matching edges from min-cut resulting in the final disparity map.

The min-cut formulation of the matching problem has been published by Boykov etal. [7, 9]. In the former work, they defined the energy function based on Markov RandomFields (MRF) with the Potts energy function, which is able to preserve discontinuities, inthe latter one they proposed two operations applied to the graph iteratively leading to theoptimal solution. The graph is constructed in a general way: vertices coincide with imagepixels and the set of all labels (plausible disparities). The neighbourhood relation betweenpixels is preserved by n− links edges, and pixels are connected by t− links edges to suchlabels, which seem to be, according to the selected model, possible solutions. The costsassigned to the individual edges correspond to the energy function terms. The aim is (asin a general case) to find a min-cut on the defined graph, which separates all the terminals(labels). Due to the NP-completeness of this task, they propose an algorithm, which findonly the approximation of the solution: it is performed iteratively, and in each iteration,only two (selected) labels are examined. The first designed operation is α-β swap move,where only the pixels which have assigned the α or β labels are taken into account, and themin-cut reorganizes assigning these selected labels between examined pixels resulting indecreasing the energy function value. The second operation is α-expansion move, where thelabel α is assigned to as much as possible pixels under condition of decreasing the energyfunction value. Again, the new labeling is found via min-cut solution. Kolmogorov [33]have developed the enlargement of Boykov’s approaches. He defined the improved energyfunction, where the occlusions are explicitly represented. The approximative minimum ofthe energy function is found using α-expansion move, which has been completely adoptedfrom Boykov’s approach.

12

Dynamic programming formulationsWhile the 2D energy function minimization (Eq. 3) is NP-hard for common classes of

functions [34], dynamic programming can find the global minimum in polynomial time, butfor each scanline independently. The optimization problem is formulated as computingthe minimum cost path through the matrix of all pairwise matching costs between twocorresponding scanlines. Occlusions are modeled by assigning a group of pixels in oneimage to a single pixel in the other image and penalizing the solution by an occlusion cost.

Dynamic programming formulation problems are the selection of the right occlusioncost, the difficulty with enforcing the inter-line consistency [4, 47], and the requirement ofenforcing the monotonicity or ordering constraint.

In Cox’s works [15, 13, 14], a maximum likelihood algorithm, that provides the unique-ness and ordering constraints, is presented. The local cost function consists of the matchingcost evaluated on individual pixels and the fixed penalty for an unmatched pixels term.The matching problem is solved for each scanline independently via dynamic programming.The experiments in [15] revealed that the multiple global minima of the cost function existresulting in enforcing the additional assumption, the cohesivity constraint, in more recentworks [13, 14]. The algorithm recovers the globally smoothest solution, i.e. the solutionwith the least number of discontinuities in both horizontal and vertical directions.

Geiger et al. [20] have proposed a method based on the Bayesian approach using priorpiecewise smoothness assumptions, uniqueness and monotonicity constraints, which explic-itly model occlusions. They made an observation that a disparity discontinuity along theepipolar line in one input image always corresponds to an occluded region in the otherimage (and vice versa), leading to the additional assumption: an occlusion constraint. Thematching cost is evaluated over shiftable windows, allowing adapting to intensity valuesand location variations. The optimal disparity function representation is obtained by themaximum a posteriori estimate, using dynamic programming.

Belhumeur [1] has developed a computational model based on Bayesian frameworkwhere the occluded regions are explicitly represented and computed. In the paper, variousscene models are derived, from simple configurations to complex scene models. They argue,that the reliable model should include not simple depth, but also discontinuities in depthat object boundaries, surface orientation, and surface creases. The defined prior modelis used to compute the optimal disparity function solution by the maximum a posterioriestimate. The new dynamic programming strategy, which simultaneously recovers depth,occluded regions, surface orientation and surface creases, is presented.

Bobick et al. [4] have presented a stereo algorithm which finds matches and occludedregions simultaneously, while no smoothness or consistency constraints are incorporated.They introduce the data structure called disparity space image (DSI), where dynamic pro-gramming approach finds the best path through. The sensitivity to occlusion cost and al-gorithmic complexity are reduced by using highly-reliable matches—ground control points,which drive the solution searching. The method is additionally extended to exploit therelationship between occlusion gaps in DSI and intensity edges, resulting in reduction ofthe cost of occlusion edges which coincide with intensity edges. The approach allows toextract accurately even large occlusion regions.

13

Local Methods

Local methods do not optimize any global function, they compute the disparity map basedon image areas (neighbourhoods). The final disparity in each pixel is determined only bythe area surrounding this pixel, which describes the locality of these methods.

Assuming area-based stereo, each pixel is tagged by a signature. This signature de-scribes the pixel and the aim is to define such a signatures that the pixels can be easilyrecognized one from the others (based on these signatures). However, in general images thisis impossible, due to various corruptions, and consequently the goal is to define signaturesas discriminable as possible.

The typical definition of signatures is as the image neighbourhood (a window of prede-fined size) of the evaluated pixel. Therefore, the proper window selection is crucial, becauseit directly influences the signature discriminability. Many approaches have addressed thistopic and we describe them afterward in more details.

Over the signatures, the matching statistics evaluating matching costs is defined. Typi-cal definitions are: sum-of-squared differences (SSD), sum-of-absolute differences (SAD), ornormalized cross-correlation (NCC). On the computed pixel costs, the matching algorithmis applied in order to establish the correspondences. In general, the disparity associatedwith the best matching cost (minimal for SSD and SAD, and maximal for NCC) for eachpixel is selected as its final disparity. The precise definition of the matching algorithm isvery important, for it could cause totally spurious results otherwise.

The classical local matching algorithm is the so called winner-takes-all (WTA) [48, 65],which selects as the match the pair with the best matching cost under assumption ofuniqueness. On one hand, this algorithm establishes the best matching solution, however,on the other hand it may produce sparse results and does not guarantee the quality ofthe matching. In the recent publication, the approach proposing more rigorous matchingalgorithm, Confidently-Stable matching (CSM) [55], has been introduced. This approach isbased on stable matching [54], which draws on stable marriage problem [25]. The stable-matching is based on the new property, the stability of the solution, which ensures the bestunique solution. The CSM algorithm is able to identify the largest unambiguous matchingon the given confidence level.

Due to locality of local methods, the general assumption of piecewise scene smoothnesscan be hardly incorporated. The methods solve this problem by aggregation the matchingcost over the area, so called support area, where the final matching cost is computed bysumming or averaging. The various ways of aggregation are discussed further.

Some approaches try to improve the matching costs by applying iterative algorithms.These methods use windows for evaluating costs, however, the iterative improvements areperformed over global areas. These methods are called cooperative algorithms and wedescribe them at the end of this section.

Proper window selection problemThe proper selection of windows shape and size is crucial for the overall success of

local methods. The windows must be large enough to capture intensity variation for

14

reliable matching and small enough to avoid the effects of projective distortions at thesame time. And moreover, the appropriate windows selection should improve the signaturediscriminability.

Due to geometric distortions, the rectangular fixed-size image windows having the pixelof interest in the centre may have difficulties. The problems arises at object boundaries orsurface orientation discontinuities, where the depth of the scene (and thus the disparity aswell) varies. The windows then cover non-corresponding parts of image (even if a pixel ofinterest correspond, cf. Fig. 4), and therefore the signatures are incomparable as well.

There have been published various approaches trying to cope with this problem: shift-able windows, windows with adaptive size, windows using image segmentation, windowsbased on connected constant disparity components, or windows modeled in disparity space.We introduce each category separately.

Shiftable windows. These methods modify neither the size nor the shape of the win-dows. They try to shift the position of the windows over the examined pixel (resulting innon-centralized location of the pixel in the window) in order to cover more correspondingparts of the input images. The selected signature tagging the pixel represents the win-dow at the best matching position. Consequently, the signatures obtained by this processshould be more reliable.

The relevant approaches have been published in [4, 20, 49]. Even the first two solvethe matching problem via dynamic programming (i.e. they are global methods), theypay attention to the matching costs improvement as well. Geiger et al. [20] use win-dows shifted into two different positions (left and right), while Bobick et al. [4] proposethe method, where nine windows positions (centre-middle, centre-left, centre-right, upper-middle, upper-left, upper-right, lower-middle, lower-left, and lower-right) are examined.

The latter approach, designed by Okutomi et al. [49], belongs fully into local methods.They try to handle the problem of boundary overreach by proposing the shiftable windowssolution. The 5 × 5 window is moved into nine positions (as in the Bobick’s work) andthe position with the lowest SSD is selected. The shiftable windows allow to detect objectboundary more precisely, however, they tend to “destroy” the smoothness of disparitysurfaces. Consequently, they propose method, where the shiftable windows are appliedonly in regions of possible boundary overreach, which are explicitly detected. In the restof the image, standard windows are used.

Windows with adaptive size. The approaches using adaptive windows try to copewith a problem of a proper window size. In texture-less areas the windows should be large(to capture the intensity variations), however, at depth discontinuities the windows shouldbe small (to avoid covering of different scene parts). Consequently, the proposed solutionstypically enlarge/compress the window size according to the examined area.

Kanade & Okutomi [32] have proposed a method, where the appropriate window isselected by evaluation the variations in intensity and disparity. The assumption is, that atdiscontinuities the intensity as well disparity variations are larger, unlike at the positions

15

on the surface of the scene object. They developed a statistical model of the disparitydistribution within the window enabling to monitor the impact of disparity and intensityvariations on the uncertainty of disparity estimate at the centre point of the window. Start-ing with the initial disparity estimate, the algorithm iteratively updates the disparity foreach pixel by choosing the shape and size of the window. The window is enlarged stepwisein all four directions. The iteration is stopped when the disparity estimate uncertainty atthe centre pixel of the window converged.

Windows using image segmentation. In order to ensure that the windows cover rel-evant parts of images, the researches tried to enforce the object position information viasegmentation. The images are segmented based on the intensity function. This methodguarantees that the windows will not overreach the segment boundary. However, whoensure that the segmentation is correct, without any errors (e.g. problems with shad-ows)? Moreover, enforcing such an information is extern direction of the algorithm, whichsuspends its purity.

Tao et al. [67] presented an approach incorporating the image segmentation. First,the disparity having the best matching cost over all possible matching pairs is selectedas the initial disparity for each pixel. Then, the reference image is segmented. To eachsegment, the disparity corresponding to the average of the initial disparities within thesegment is assigned. The second image is predicted from the reference image using thedepth information. The matching is performed based on global match measure, which isdefined as the similarity between the predicted and the second image. The goal is to findthe depth map maximizing the global match measure. The segmentation allows betterrepresentation of depth information, however, it can make various mistakes.

In [70], a cooperative algorithm employing image segmentation is proposed. The wholealgorithm we describe in section dedicated to the cooperative algorithms, here we emphasizethe segmentation-based step. The segments, computed based on intensity function, aredivided into local parts—patches, where the disparity is assuming to be constant. Thepatches define support regions. The matching costs are improved based on neighbouringpatches within a segment. The segmentation ensures, that the support area will coincidewith the segment boundary. However, spurious results may be produced due to wrongsegmentation (which can be observed in the experiment introduced in the paper).

Windows based on connected constant disparity components. Boykov et al. [6]published approaches where the windows are defined to cover the parts of scene with thesame disparity—constant disparity components.

In [6], for each pixel, the set of plausible disparities is computed. Using the neighbour-hood relation, the disparity components are created by connecting the pixels having thesame plausible disparity. As the resulting disparity for each pixel, the disparity belonginginto the largest disparity component is selected. The proposed algorithm tends to createpiecewise constant disparity map, resulting in problems with various slanted objects.

16

Windows modeled in disparity space. In recent approaches, researches have noticed,that due to the geometric distortions caused by different cameras positions, to model thewindows in the input images and expect that they can cover the same part of the sceneis impossible. Consequently, they focus on proposing the method (or structure) wherethe real scene could be modeled, resulting in the disparity space (image)—a 3-D structureconsisting of all the possible matching pairs.

Since the disparity space contains all the possible matching pairs, it has to be feasibleto model there the disparity surface. Consequently, if we let the windows adapt to struc-tures of disparity surface, it definitely ensures that they cover the same part of the scene.Accordingly, the discriminability of the evaluated signatures should be better.

Szeliski & Scharstein [65] proposed a method for evaluating the matching cost in thedisparity space. They defined the disparity space image (DSI), which is a 3-D structure of(x, y, d) points, where (x, y) represents a pixel in the input left (reference) image, and d allthe possible disparities. In order to avoid the perspective distortions, the image functionsare interpolated resulting in the continuous DSI. Such a definition allows to evaluate thematching costs more precisely and consequently it improves the matching results.

Matching cost aggregation supportDue to the natural impossibility of incorporating the smoothness assumption into the

window-based approaches, some local methods enforce smoothness constraint by aggrega-tion over the support regions. The support region is the area, surrounding the examinedpixel in the images, which is, with high probability, part of the same surface patch. Thefinal matching cost is aggregated by summing or averaging over these regions. They canbe represented in many ways. The most typical definitions use various kinds of adaptivewindows, discussed in previous paragraphs. Other approaches proposed regions limited bydisparity difference [24], or by disparity gradients [50].

Pollard, Mayhew and Frisby [50] have proposed a method, where the support regionis defined based on the disparity gradient limit constraint. The disparity gradient limitbetween two points is the difference in their disparity divided by their cyclopean separation.This definition was derived from the observation, that the disparity gradients existingbetween correct matches are small almost everywhere, unlike for spurious correspondences(resulting from the cohesivity assumption). In the PMF approach, the matching processhas two steps. Firstly the matching cost of each potential match is computed as a sumof weighted contributions received from all potential matches in its support region. Then,iteratively the symmetric highest matching cost pairs are considered as correct.

Another ways of aggregation are iterative diffusion [59], or belief propagation [63]. Theseapproaches do not define any “window-based” support region. The aggregation is per-formed by iterative improving of matching costs based on their neighbours.

Scharstein & Szeliski [59] have proposed a method, where the aggregating support isbased on non-linear diffusion at different disparity hypotheses. The prior model is definedas Bayesian framework, where the MAP estimate is computed by iterative diffusion. Theprocess is controlled by the current quality of the disparity estimate.

In [63], Sun et al. proposed an approach based on Markov network, where disconti-

17

nuities and occlusions are explicitly modeled. The discontinuity and occlusion processesare substitute by robust functions in order to apply the maximum a posteriori algorithm.The MAP solution is obtained by applying a Bayesian belief propagation (BP) algorithm.The BP iteratively improves the belief at the given nodes based on their surrounds. Thedisparity map is constructed selecting the nodes with the highest belief.

Cooperative MethodsCooperative algorithms are inspired by computational models of human stereo vision.

They iteratively perform local computations using non-linear operations supporting thelocal costs that results in an overall global behaviour similar to global optimization algo-rithms. These methods could be treated as a conjunction of local and global methods.

Zitnick & Kanade [71] proposed an iterative algorithm, which is a follow-up to theMarr’s approach [43]. They defined a disparity space as a 3D structure of (x, y, d) pointsrepresenting for each pixel (x, y) in a reference image all the plausible disparities d. In thisstructure, two different areas are defined: support area representing smoothness constraint,and inhibition area representing uniqueness constraint, driving the iteration process. Thealgorithm starts with creating the disparity space and computing for all its points (matches)the initial matching costs based on image intensity functions. In each iteration, the match-ing costs are improved based on the support area, while the competitors (the pairs in theinhibition area) are suppressing. This process is repeated until the uniqueness constraint isfulfilled. The resulting disparity is constructed explicitly from the disparity space: for eachmatch the highest cost is selected. If this cost is higher than a given threshold, the corre-sponding disparity is accepted. The occlusion region is assigned otherwise. It is requiredto mention here, that the described method is based only on heuristic design and doesnot produce good quality results. Moreover, the method has not defined any optimalitycriterion, thus the results are not optimal in any way.

The recent cooperative approach has been published by Zhang & Kambhamettu [70].Although their algorithm adopted completely the ideas of the Zitnick’s approach [71], thismethod has been extended in two ways: first, a method that guarantees that correctmatches have high matching costs is designed. Second, the support regions are definedbased on image segmentation information. Having computed initial matching costs in thedisparity space, they are adjusted in order to ensure that correct matches produce highcosts. Due to noise corruption or various image distortions even the correct matches couldproduce low matching costs. Since the cooperative algorithms improve the matching costsbased on local support, they require the correct matches to produce high initial matchingcosts. Based on the segmentation, the image is divided into local patches (constant dis-parity parts of a segment), where the matching cost of the ground control points (highlyreliable matches) is assigned to the whole patch. The image patches define support re-gions and the matching costs improvement is performed based on neighbouring patcheswithin the segment. The enforcing the ground control points guarantees more reliableinitial matching costs, while the image segmentation ensures that the support areas willnot overreach segment boundary. However, there is no assurance, that the segmentationis correct, what can drive the solution into completely spurious results.

18

2.2 Stereo Algorithms Evaluation

Although there exist an enormous number of various approaches focusing on stereo match-ing (described in previous paragraphs), very few projects dedicated to their evaluationhave been proposed. Due to the decreasing interest in feature-based methods we put theemphasis just on dense area-based binocular stereo matching. There are two main classesof algorithm evaluation methods.

Class 1 methods do not use any ‘ground-truth’ information, they are based on a se-quence of images. They use predicted matches obtained from image subsequence andevaluate their mutual consistency [41] or validate them in independent images [64].

The self-consistency method [41] works by first dividing the image set into overlap-ping subsets. The matching problem is solved for each subset and the correspondenceinformation is used to reconstruct a spatial point. A statistic is defined on mutual dis-tances of points reconstructed from different subsets. Points are uniquely identified bytheir projected coordinates in an image common to the subsets.

The prediction error method [64] partitions the set of images into prediction and val-idation sets. Matches from the prediction set are transferred to validation images andresidual disparity vectors are computed for each predicted point. The prediction error isa statistic defined on the combination of image similarity and residual disparity vectormagnitudes. This method may fail to recognize structural errors due to repetitive patternsin the images.

The Class 1 methods are suitable when data is given and the experimental procedurecannot be re-designed. They can work with very complicated scenes where ground truthwould not be possible to obtain. Complicated scenes, on the other hand, do not allowto distinguish errors due to half-occlusions, repetitive structures, low signal-to-noise ratio,surface non-Lambertianity, and systematic errors on occlusion boundaries. Except thesereasons, there exist one more objection, and probably the most limiting, for using thesemethods for evaluation: although the predicting ability for the images is good, the disparitymap can be completely erroneous [22].

Class 2 methods are based on ground-truth. Ground-truth is either obtained fromindependent measurement (using range-finder [46], ground control points [21] or digitalelevation model [16]) or semi-manually with the help of a strong prior model (e.g., piecewiseplanarity). Recently, the most often used data of the second type is the Lab Scene fromthe University of Tsukuba [57, 58]. An overview of less often used datasets is given in [64].

Ground-truth from independent measurement must be obtained by a method by atleast an order of magnitude more accurate than stereo, which is not always possible. Inprinciple, complicated scenes can be measured as long as their complexity does not hinderthe accuracy of the independent method. The semi-manual ground-truth must be obtainedfrom independent images, not the data set itself. Nine images were used for the LabScene [57], for instance.

There have been several medium to large-scale efforts to evaluate stereo matching al-gorithms in a systematic way [37, 29, 5, 66, 61].

19

In [37] ten different stereo algorithms were re-implemented and evaluated. Their com-parison was based on the number of correctly matched pixels and thus only measuresthe overall quality of the matching. The choice of the test images is not focused on anyparticular application nor motivated and/or documented.

In [29] performance evaluation was focused on cartographic feature extraction appli-cation. This strongly influenced the test data selection. Only two algorithms based ondifferent matching techniques (area-based and feature-based approaches) were tested.

For a long time unsurpassed evaluation study has been the JISCT effort [5]. It is alsoapplication oriented1 but methodologically very advanced. On a large set of different stereoimages (44) from real complex scenes, various approaches from different groups (INRIA,SRI, Teleos) were statistically evaluated based on three types of errors: false negatives,false positives and mismatches (using our terminology). Ground-truth information wasprovided manually. Certain weakness of this study is that this ground truth is only partialand that the method is not focused on error mechanisms.

The first attempt of creating an evaluation methodology for contemporary stereo al-gorithms has been published in [66]. Two different evaluation methods—comparison withground-truth and prediction error [64]—were applied to a few stereo algorithms (such asarea-based correlation methods, MLMHV [14], graph cuts [9], cooperative algorithm [71]).This study does not propose precise stereo algorithms evaluation, however, it was employedas a core of a large evaluation method [61].

Nowadays widely used evaluation study is [61]. The authors selected four different stereoscenes to establish an evaluation test set. The evaluation methodology is based on ground-truth comparison. Two different statistics: the percentage of bad matching correspondencesand root-mean-squared error have been proposed. The following regions: texture-less, oc-cluded and depth discontinuity have been selected in order to support detailed algorithmscomparison. The authors implemented a few stereo algorithms and published the resultsand a taxonomy of these algorithm together with free evaluation code at their evalua-tion web pages [60]. The other stereo researches are asked to run their algorithms on thetest set and to contribute with their evaluated results to a taxonomy of nowadays stereoapproaches. This study is a very well designed methodology for the overall ranking andcomparison of stereo algorithms. Nevertheless, it does not offer any ability for studying thealgorithms behaviour in detail in order to discover their specific weaknesses. First, they donot give any mechanism for parameters’ setting. Therefore, the authors set parameters oftheir methods somehow (in the worst case for each experiment separately), which leads tonon-comparative results. Second, the half-occluded regions are excluded from the consid-ered errors. Thus, the experiments do not say anything about how the algorithms are ableto handle occlusions. Third, although the texture-less regions are monitored, the experi-ments do not have any potential for detailed algorithm investigation in these areas. Thetexture-less regions are only segmented from the input images, and thus the experimentscan only say if the correspondences are or are not established there. No possibility forstudying the improvements as a function of varying signal-to-noise ratio is offered. Fourth,

1Motivated by a development of stereo vision module for unmanned ground vehicle.

20

the selected scenes are not very convenient for examining the algorithms properties. Threeof them consist of scenes created from objects/planes slanted under different angles withvarious textures. The last one—the Lab scene, on the other hand, represents very complexand difficult scene. To the contrary, there is no scene dedicated to repetitive patterns, andconsequently it is impossible to study algorithms failure due to structural ambiguity at all.

3 Proposed Methods

This section is directed to introduction of our approaches to matching [40, 39] and evalu-ation [54, 39], respectively.

3.1 Disparity Component Matching

Our approach belongs to local area-based stereo methods, the matching is performed bythe stable matching algorithm [53, 55]. Consequently, the matching cost computation stepis crucial for establishing the matching correctly. In this research, we concentrate ondefining the approach, which produces more reliable matching costs than other standardwindow-based approaches.

In our work, the correspondences between points are found based on signatures, whichare defined on a small neighbourhood (window of defined size) in input images. Signaturesof tentative matches are compared by a normalized cross-correlation statistics. In order toobtain the correspondences correctly it is important to improve discriminability of signa-tures. We expect discriminability to improve when image windows correspond to the samesurface patch by adapting to its tilt (and possibly curvature), cf. Fig. 4.

There have been several attempts to cope with such a problem. In [6], the notion ofdisparity components has been introduced. The disparity components have been defined asconnected sets of pixels of the same disparity. The algorithm is as follows: for each pixel,a few plausible disparities are computed. Based on these disparities, the pixel is assignedto respective disparity components (with the disparities plausible for the pixel). The finaldisparity for each pixel is found only by selecting the disparity of the largest disparitycomponent assigned to this pixel. Kanade and Okutomi [32] have already used the term‘adaptive window’ for rectangular windows, which adapt only by changing their size. Intheir approach, the window size is increased iteratively in all four directions, until disparityestimate uncertainty at the window’s centre converged. Chen and Medioni [11] have pro-posed a method, where the pixel disparities are computed by tracing the high correlationstructures. Firstly, local maxima of correlation are selected as the seeds. Then, the dispar-ity of all pixels is computed by tracing out the patches of high correlation. Zhang [70] hasdefined the windows (support regions in their terminology) to cover areas with the samedisparity. Based on this knowledge, the matching is performed. The disadvantage of thismethod is, that the windows are not found based on the data during the matching process,but on the contrary, they are established by enforcing the image segmentation.

From our point of view, to avoid the situation shown in Fig. 4, the windows have to cover

21

C2C1

LEFT RIGHT

Figure 4: The demonstration of insufficiency of rectangular windows in input images for computing thesignatures due to different points of view of cameras. The scene is captured by two cameras: C1, C2resulting in left and right input images. Although the point of interest is in the middle of both left andright windows, and the left half of windows is identical (marked by blue solid line), in the right half of thewindows is captured completely different part of the scene (marked by red dashed line in the right windowand by green dash-dotted line in the left window). That is why the windows are incomparable.

just the corresponding parts, in other words to adapt to structures from “the real scene”,which is impossible in the input images. Therefore, we have established an approach calledDisparity Component Matching [40, 39], where we assume the “real scene” has an image inthe disparity space (which is a set of all tentative matches). We propose applying adaptivewindows which adjust to connected structures of points with high-similarity values in thedisparity space (we call them the disparity components) . To capture small object surfacevariations the disparity within one component is allowed to differ for two neighbouringpixels by one (which results in components of non-constant disparity components). Thisenables us to match correctly small, curved, or very slanted objects. The shape of theadaptive windows is found by projection the disparity components to the input images,which guarantees that windows cover the same structure. The two adaptive windows (whichare typically non-rectangular) are then used to re-compute the similarity values. Finally,the matching problem is solved on the re-weighted signatures. This is where we differfrom [6, 32, 11].

Disparity space is a set of all possible matches between corresponding rows fromtwo or more images. The disparity space can be visualized as matching tables computedseparately for each image row (note that the input images are rectified) and stacked on topof each other. Each matching table consists of similarity values evaluated on the Cartesianproduct of left and right image pixels in the equivalent row. Matching table elements arecalled (tentative) pairs. A part of the disparity space (matching tables for rows r − 1, rand r + 1) is shown in Fig. 5, top row. Rows of matching table represent columns in leftimage (labeled by i), columns of matching table represent columns in right image (labeledby j).

22

4-neighbourhood: 20-neighbourhood:Constant Disparity Components Varying Disparity Components��

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!

"�"�"�"�"�"�"�""�"�"�"�"�"�"�""�"�"�"�"�"�"�""�"�"�"�"�"�"�"

#�#�#�#�#�#�#�##�#�#�#�#�#�#�##�#�#�#�#�#�#�##�#�#�#�#�#�#�#

$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$

%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%

&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&

'�'�'�'�'�'�'�''�'�'�'�'�'�'�''�'�'�'�'�'�'�''�'�'�'�'�'�'�'

(�(�(�(�(�(�(�((�(�(�(�(�(�(�((�(�(�(�(�(�(�((�(�(�(�(�(�(�(

)�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�)

*�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�*

+�+�+�+�+�+�+�++�+�+�+�+�+�+�++�+�+�+�+�+�+�++�+�+�+�+�+�+�+

,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,

-�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�-

.�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�.

/�/�/�/�/�/�/�//�/�/�/�/�/�/�//�/�/�/�/�/�/�//�/�/�/�/�/�/�/

0�0�0�0�0�0�0�00�0�0�0�0�0�0�00�0�0�0�0�0�0�00�0�0�0�0�0�0�0

1�1�1�1�1�1�1�11�1�1�1�1�1�1�11�1�1�1�1�1�1�11�1�1�1�1�1�1�1

2�2�2�2�2�2�2�22�2�2�2�2�2�2�22�2�2�2�2�2�2�22�2�2�2�2�2�2�2

3�3�3�3�3�3�3�33�3�3�3�3�3�3�33�3�3�3�3�3�3�33�3�3�3�3�3�3�3

4�4�4�4�4�4�4�44�4�4�4�4�4�4�44�4�4�4�4�4�4�44�4�4�4�4�4�4�4

5�5�5�5�5�5�5�55�5�5�5�5�5�5�55�5�5�5�5�5�5�55�5�5�5�5�5�5�5

6�6�6�6�6�6�6�66�6�6�6�6�6�6�66�6�6�6�6�6�6�66�6�6�6�6�6�6�6

rowscols2

cols1

i

j

r−1 r r+1 ......

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!

"�"�"�"�"�"�"�""�"�"�"�"�"�"�""�"�"�"�"�"�"�""�"�"�"�"�"�"�"

#�#�#�#�#�#�#�##�#�#�#�#�#�#�##�#�#�#�#�#�#�##�#�#�#�#�#�#�#

$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$$�$�$�$�$�$�$�$

%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%%�%�%�%�%�%�%�%

&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&&�&�&�&�&�&�&�&

'�'�'�'�'�'�'�''�'�'�'�'�'�'�''�'�'�'�'�'�'�''�'�'�'�'�'�'�'

(�(�(�(�(�(�(�((�(�(�(�(�(�(�((�(�(�(�(�(�(�((�(�(�(�(�(�(�(

)�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�))�)�)�)�)�)�)�)

*�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�**�*�*�*�*�*�*�*

+�+�+�+�+�+�+�++�+�+�+�+�+�+�++�+�+�+�+�+�+�++�+�+�+�+�+�+�++�+�+�+�+�+�+�+

,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,,�,�,�,�,�,�,�,

-�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�--�-�-�-�-�-�-�-

.�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�..�.�.�.�.�.�.�.

/�/�/�/�/�/�/�//�/�/�/�/�/�/�//�/�/�/�/�/�/�//�/�/�/�/�/�/�/

0�0�0�0�0�0�0�00�0�0�0�0�0�0�00�0�0�0�0�0�0�00�0�0�0�0�0�0�0

1�1�1�1�1�1�1�11�1�1�1�1�1�1�11�1�1�1�1�1�1�11�1�1�1�1�1�1�1

2�2�2�2�2�2�2�22�2�2�2�2�2�2�22�2�2�2�2�2�2�22�2�2�2�2�2�2�2

3�3�3�3�3�3�3�33�3�3�3�3�3�3�33�3�3�3�3�3�3�33�3�3�3�3�3�3�3

4�4�4�4�4�4�4�44�4�4�4�4�4�4�44�4�4�4�4�4�4�44�4�4�4�4�4�4�4

5�5�5�5�5�5�5�55�5�5�5�5�5�5�55�5�5�5�5�5�5�55�5�5�5�5�5�5�5

6�6�6�6�6�6�6�66�6�6�6�6�6�6�66�6�6�6�6�6�6�66�6�6�6�6�6�6�6

rowscols2

cols1r+1rr−1

i

j

... ...

j

i

r−1

... i

j

r

j

i

r+1

... ...

j

i

r−1

i

j

r

i

j

r+1

...

Figure 5: Neighbourhoods of point on row r on position (i, j) in matching table (empty circle): leftcolumn represents neighbourhood for disparity components with constant disparity, right column for dis-parity components with varying disparity. At the top, the part of the disparity space—for rows r − 1, rand r + 1—is shown, while at the bottom, the matching tables for defined rows—with marked adequateneighbourhoods—are shown separately.

The disparity components are defined in the disparity space as connected structuresof pairs with high similarity values. The connectedness within a disparity component isdefined by the neighbourhood relation. Two high similarity pairs (r, i, j) and (r′, i′, j′),where r, r′ mark rows, i, i′ columns in left image and j, j′ columns in right image, areneighbours in the disparity space if and only if (1) they are neighbours to each other inthe left or right image, and (2) the difference of their disparities is smaller or equal to apredefined value δ. In mathematical terms the neighbourhood relation can be formulatedin the following way:

Definition 1: The pairs (r, i, j) and (r′, i′, j′) in the disparity space are neighbours toeach other if and only if the following three conditions hold:

1. |r − r′| ≤ 1

2. |i− i′| ≤ 1 ∨ |j − j′| ≤ 1 for r = r′

i = i′ ∨ j = j′ for r 6= r′

3. |d(r, i, j)− d(r′, i′, j′)| ≤ δ

where r, r′ mark rows, i, i′ mark columns in left image, j, j′ mark columns in right imageand d(r, i, j) = i− j represents the disparity of the point.

23

Using the neighbourhood relation recursively, the disparity components are traced out.For each match (r, i, j) in the disparity space, the corresponding disparity component canbe identified uniquely (one match can be part of one component at most). Based on thedisparity component, the shape of adaptive window and its projection into input left andright image are found. This is the way our windows adapt to corresponding parts of thescene.

The parameter δ, in the definition of the neighbourhood relation, allows variations ofdisparities within one disparity component. If we restrict our definition only to points withthe same disparity (δ = 0), we obtain constant disparity components with 4-neighbourhoodrelation, shown in Fig. 5, in the left column. The definition of 4-connected componentscorresponds to Boykov’s approach [6], which is limited to scenes with objects of constantdisparity (which is a strong restriction).

In our approach, the difference of neighbouring pixel disparities is allowed to be smalleror equal to one (δ = 1). Consequently, we obtain varying-disparity components with 20-neighbourhood relation, shown in Fig. 5, in the right column. This definition allows tocapture small variations in disparity.

3.1.1 Disparity Component Algorithm

In this paragraph we describe the Disparity Component Matching Algorithm. The inputfor this algorithm are left and right rectified images and the output is the disparity mapof a scene. The algorithm consists of four steps:

1. preliminary selection of match candidates (called aggregation in [61]):– to find matches with high similarity values,

2. tracing the disparity components:– to identify disparity components for each match candidate,

3. re-weighting the similarity values:– to obtain more reliable similarities,

4. computing the final disparity map:– to find resulting correspondences.

In the first step, a preselection algorithm selects match candidates. The disparity spaceis segmented to a set of connected structures (created by match candidates) by eliminatingweak tentative pairs. The requirements on results of this step are following: (1) multi-valueresults (to obtain matching hypotheses), (2) dense results (desired without holes), and (3)not to omit structures from the image.

The preselection step can be based on global energy minimizations [30, 52, 33] or localcorrelation methods [32, 54]. In our approach, we have applied a local correlation method.The correlations are evaluated on small rectangular fixed-size (5×5 pixels) image windows.

24

In the second step, the disparity components are traced out on match candidatesresulting from the first step. The tracing is based on applying recursively the defined20-neighbourhood relation, shown in Fig. 5 in the right column.

For each preselected match candidate, the unique disparity component is identified (onematch can belong only to one disparity component). Based on the disparity component,the matching window is found in the disparity space. The projection of this window intothe left and right image defines unambiguously two non-rectangular (adaptive) windowsfor this match in the input images.

In the third step, for each match candidate, a new similarity value is computed usingthe two non-rectangular windows resulting from the second step. The match similarity isre-computed only if the corresponding disparity component is large enough (the minimalsize is defined in advance), otherwise it is removed (to suppress the mismatches caused bynoise or weak-texture areas).

In order to obtain similarity values comparable in their statistical properties, eachtime only a small fixed-size subset (with pre-defined size) of the disparity component istaken into account. High similarity computed over a large window and high similaritycomputed over a small window are incomparable because they are likely to have verydifferent confidence intervals. This is the reason why matching window size must be equalfor all pixels. Although enlarging the window increases similarity discriminability, fromour experience it follows that there is no need for large matching windows in area-basedapproaches [53].

In the fourth step, the final disparity map is computed. The only requirement is setto this step: low error rate. The density of the disparity map is not privileged, howeverthe results are desired to be as dense as possible. As we will see in Sec. 4, although thedisparity component matching does not produce dense disparity maps, it does considerablyincrease its density if compared to a plain algorithm.

Any stereo matching algorithm can be used to compute the final matching. The methodis applied to the set of re-weighted match candidates (and no re-evaluation of matchingcosts is performed) and it results in a single-valued disparity map.

3.2 Stereo Algorithms Evaluation

Our goal is to design the evaluation methodology which enables not only the stereo al-gorithms ranking, but also allows to study matching error mechanisms with the aim todo stereo algorithm development. It is therefore necessary to design a set of experimentaldata where each individual experiment is targeted on a specific cause of error and othercauses of errors are excluded. We focus on failure due to insufficient signal-to-noise ratio.Error mechanisms can then be studied in detail to discover specific weaknesses of variousalgorithms.

25

Our proposed approach [54] belongs to evaluation methods based on ground-truth. Theaim of our work is not to evaluate all available matching algorithms but to show this ispossible with the proposed method. Our intention is to make the data and the evaluationcode public at our web site. If the response will be positive, larger evaluation study willbe possible.

3.2.1 Types of Error

Eight specific error types were considered. They are all mutually related and all of themare important for assessing the quality of a matching algorithm. We assume half-occludedregions are identified.

All errors are computed from three basic matching error types:

1. False positives, i.e. matches found in half-occluded region,

2. False negatives, i.e. missing matches in binocularly visible area (holes),

3. Mismatches, i.e. matches in binocularly visible area where the difference from ground-truth was greater than 1.

In our experiment we distinguished foreground object and background object, since we areinterested to see if any of the studied matchings exhibits any bias towards large objects.The errors computed for the sake of the evaluation were divided in three groups:

1. Overall quality:

• Failure Rate (FR) is the number of all mismatches and all false negatives inbinocularly visible region normalized by the area of this region. This error isrelated to overall matching quality but does not measure half-occluded regionartifacts. It is a sum of two components defined below:

FR =1

2(MR + FNR).

• False Positive Rate (FPR) is the number of matches in half-occluded regionnormalized by the area of this region. This error measures the inability tocorrectly detect half-occluded regions.

• Occlusion Boundary Accuracy (OBA) is the number of false negatives and falsepositives in the vicinity of ±2 pixels around occlusion boundary normalizedby the entire area of this region. This error measures the occlusion boundarydetection quality; zero-value indicates its correct detection. It is a sum of twocomponents defined below:

OBA =1

2(OBFPR + OBFNR).

2. Accuracy and density:

26

• Mismatch Rate (MR) is the number of all mismatches normalized by binocularlyvisible area. This error measures the accuracy of matching.

• False Negative Rate (FNR) is the number of missing matches (holes) in bothobjects normalized by the entire binocularly visible area. This error measuresthe sparsity of disparity map.

• Occlusion Boundary False Positive Rate (OBFPR) is the number of false positivematches in half-occluded region within two pixels from the occlusion boundarynormalized by the area of this region. This error measures the shift of a mis-detected occlusion boundary into the half-occluded region (thus narrowing thehalf-occluded region).

• Occlusion Boundary False Negative Rate (OBFNR) is the number of missingmatches (holes) in binocularly visible region within two pixels from the occlusionboundary normalized by the area of this region. This error measures the shiftof a misdetected occlusion boundary into the object (thus widening the half-occluded region).

3. Unbiasedness (B) is the ratio of failure rates computed for the foreground and thebackground object independently:

B =FRbackground

FRforeground

. (6)

Matching algorithm is biased if it assigns correct matches in large objects more oftenthan in small objects. Small values of B imply biased matching, the value of oneimplies unbiased matching.

Since various matching algorithms may differ in matching window size, differences in thevicinity of occlusion boundary must be expected. When large windows are used, occlusionboundary tends to shift from the true position as far as matching window radius [56].From our experience it follows that 5×5 matching window is sufficient for the test dataset.We therefore assume any tested algorithm will use matching window of 5 × 5 pixels orsmaller. The region of ±2 pixels around the boundary is excluded from evaluation of alltypes errors except for occlusion boundary accuracy, for which only the right occlusionboundary is considered (the one neighboring half-occluded regions shown in Fig. 7).

3.2.2 Experimental Setup

The test scene we used (see Fig. 6) consists of three long thin textured stripes (we call itobject; it is held in place by a silver U-frame visible at right) in front of a textured plane (itwill be called background; it is far-right, behind the frame). Of the four cameras (mountedon a black frame at left) we used only one vertical pair in this experiment. We used digitalPulnix TM-9701 cameras, DataTranslation DT3157 frame-grabber and a custom digitalmultiplexer.

27

Figure 6: The test scene: experimental setup

The object stripes are approximately 14 pixels wide in the images. The object-back-ground distance was adjusted such that the width of the half-occluded region is exactlyequal to the width of the stripes (widening the half-occluded region would result in violatingthe binocular ordering).

The scene was illuminated by controlled stabilized illuminant, whose adjustable inten-sity was used to vary texture contrast. The contrast was measured as the mean value ofthe left image. The smallest-contrast texture is visually hardly distinguishable from imagenoise (contrast value of 3.0). In Fig. 7 left images from the test set are shown for contrastvalue of 3.0, 10.5, and 74.0, respectively (from left to right).

Ground-truth occlusion boundaries were detected manually with the help of illumina-tion that enhanced them in the images. The ground-truth disparity within the boundarieswas obtained by fitting planes to disparity map obtained from Maximum-Cost StableMonotonic Matching algorithm [53] using the highest-contrast images. The resulting mapis shown in Fig. 7 right. Color coding is used since disparity errors stand out much betterfor visual inspection.

The cameras used in the experiment were verging towards a spatial point slightly be-hind the scene and their common optical axis was thus not perpendicular to the target.This results in slowly varying disparity across both the object and the background. Thebackground disparity varies from −5 to 20 pixels and the object disparity varies from−16 to 9 pixels. Triangular portions of the map along the left and right image edges areexcluded from the ground truth since various algorithms respond differently to non-zerodisparity at the image border.

The presented evaluation methodology focuses on matching algorithm performanceunder poor image SNR and does not cover failures due to structural ambiguity (the presenceof repetitive patterns) or non-Lambertian surface reflectance. Our test data are designedin a way that these error mechanisms (which are of fundamentally different nature) donot bias the results: object texturing is random, of uniform contrast and without contrastchanges across discontinuities, and specularities are excluded.

28

lowest-contrast middle-contrast highest-contrast ground-truthdisparity map

Figure 7: Various contrast values of texture: 3.0, 10.5 and 74.0, and the resulting ground-truth disparitymap. The bar at right shows disparity map color coding: low disparities are blue, high disparities are red.

3.2.3 Evaluation Method

The purpose of this section is to describe what are the abilities for algorithms evaluationand studying offered by our methodology. Its exemplary demonstration with detailedinterpretation of evaluated results is performed in Sec. 4.

Our approach gives two opportunities for analyzing algorithms behaviour. First oneis based on graph-plots depicting all the defined errors independently, shown in Fig. 8.On these graphs it is possible to study, how the results of algorithms change accordingto increasing signal-to-noise-ratio. Texture contrast, depicted on the horizontal axis in allplots, is directly related to signal-to-noise-ratio, while vertical axes show the respectiveerror rates. Both axes use logarithmic scale.

The second ability for algorithms studying is based on resulting disparity maps, shownin Fig. 9. From the results the error-disparity maps are computed, where in each pixel,for the incorrect matches the corresponding error is depicted, while for correct matches itsdisparity is assigned.

4 Experiments

In this section, we present experimental results of both proposed methods (the Disparitycomponent matching and the stereo algorithms evaluation method).

The Disparity component matching is evaluated on two different experiments, whichgive qualitative and quantitative comparison with another two stereo approaches: Confi-dently-stable matching [55] and Maximum-likelihood algorithm [15], described in the nextparagraph.

The two experiments are designed to assess our evaluation methodology with (the only)nowadays evaluation method. In the first experiment, we perform comparison based on ourevaluation methodology. In this experiment, we do not only examine matching algorithms,

29

but also we show how to interpret the evaluation results. In the second experiment, wecompare the methods based on Scharstein & Szeliski’s test dataset [61], which is widelyused for ranking existing stereo methods.

4.1 Evaluated Algorithms

We have selected three following algorithms to be examined: the Disparity componentmatching [39], the Confidently-stable matching [55], and the ML matching based onCox et al. algorithm [15].

The Disparity component matching (DCM) is our proposed method to which the otheralgorithms are compared. Since the fourth step of DCM was performed by the Confidently-stable matching (CSM), the CSM was selected as the second algorithm. As the thirdalgorithm, we chose the Cox’s algorithm, which is the representative of the opposite classof stereo algorithms—of the global methods.

To get comparable results, all tested algorithms used 5 × 5 matching window. Theparameters for the tested methods were set to produce globally as good results as possible,and they are equal in all the experiments.

Disparity Component Matching This algorithm is our proposed method described inSec. 3.1. For the following experiments, the match candidates (the first step) were selectedby the stable matching algorithm [54] assuming the ordering constraint. Using of finiteinhibition zone depth let to produce multi-value results. The inhibition zone depth wasset to 2 pixels, which is the minimum allowing disparity components segmentation. Thecomputing of the final disparity map (the fourth step) was performed by the Confidently-stable matching [55], described in the next paragraph. The Confidently-stable parametersα and β were set to 20 and 0.05, respectively.

Confidently-Stable Matching This algorithm, proposed in [55], establishes matchingbased on the stability constraint, which guarantees the best unique solution on the givenconfidence level. The ordering and uniqueness constraints are incorporated. The confidenceparameters α and β were set to 10 and 0.02, respectively.

Both the DCM (in the first step) and the CSM used modified cross-correlation [45] forimage similarity:

MNCC(WL, WR) =2 cov(WL, WR)

var WL + var WR

∈ [−1, 1] (7)

computed between 5× 5 image windows WL, WR.

ML Matching The Maximum likelihood (ML) matching is a re-implementation of theCox et al. algorithm [15]. The correspondences are computed for each epipolar line sep-arately via dynamic programming based on sum-of-squared-differences (SSD). Orderingconstraint is employed. The regularization penalty was set to the fixed value of 500.

30

4.2 Evaluation Based on our Methodology

This methodology has been proposed in our research, for detailed description see Sec. 3.2,in order to design the method, which allows to study stereo algorithms in detail. Weconsider this ability as the indispensability for further algorithms improvement.

In this section, we demonstrate all the capacities of our methodology together withthe examined algorithms evaluation. In order to obtain comparable results, in all casesdisparity search was done over the full range of ±333 pixels in 587× 333 images.

Results for all the tested algorithms (the Disparity component matching–DCM, theConfidently-stable matching–CSM, the Maximum-likelihood matching–MLM) are shownin plots in Fig. 8. Texture contrast (on the horizontal axis in all plots) is directly relatedto signal-to-noise ratio. Respective error rates are shown on the vertical axes. Note thatboth axes have logarithmic scale.

Failure Rate (Fig. 8(a)). Although we can see the overall quality of the CSM is compa-rable to that of the MLM, the visual inspection in Figs. 9(g) and 9(k) completely disprovesthis impression. The high error rate in the CSM is due to the false negative component ofthe error (see later). The re-ascending character of the failure rate curve in the MLM isdue to increasing sparsity in high-contrast images, cf. Fig. 9(k). We can see that the DCMapproach has lowest overall failure rate, that can be confirmed in Figs. 9(a) and 9(c).

False Positive Rate (Fig. 8(b)). The DCM as well the CSM are by the order of amagnitude better the MLM for the highest contrast, however for the lowest contrast theyare incomparably better (cf. Fig. 9). The complete failure of the ML algorithm under lowcontrast images is the reason why the false positive rate is so high for this algorithm. Theway it fails is clearly seen in Fig. 9(i). This is a manifestation of unbalanced continuityprior. It becomes too strong under low image SNR and the solution is thus unable to makethe jump necessary to detect the thin object.

False Negative Rate (Fig. 8(c)). The sparsest maps are obtained from the CSM, whilethe densest maps from the ML algorithm, except for large contrasts where the DCMperforms better by the order of a magnitude. It is interesting to note that while the DCMlikewise the CSM improve the density of results with increasing SNR, the MLM performsinverse behaviour. The density of the MLM results for low contrasts is achieved at thecost of missing the foreground object completely, as shown in Fig. 9(i). Note the sparsityof the CSM along lines where disparity changes by a unity (Fig. 9(g)). This is probablydue to sensitivity of correlation measure to image discretization.

Mismatch Rate (Fig. 8(d)). The DCM and the CSM produce comparable results, whichare by two orders of magnitude better than the ML matching. The results for the DCM con-firm the observation that this approach significantly improves false negative rate (Fig. 8(c)),while the accuracy remains identical. The poor results for the ML algorithm are due to itstendency to either miss the foreground object or to smooth the discontinuity between thethin foreground object and the background.

Occlusion boundary accuracy (Figs. 8(e), 8(f), 8(g)). The overall accuracy of occlusionboundary detection is comparable in all algorithms, but they differ considerably in the

31

100

101

102

10−

2

10−

1

100

Failure R

ate

FR

texture contrast

DC

MC

SM

ML

(a)binocular

artifacts

100

101

102

10−

5

10−

4

10−

3

10−

2

10−

1

100

False P

ositive Rate

FPR

texture contrast

DC

MC

SM

ML

(b)m

onocularartifacts

100

101

102

10−

4

10−

3

10−

2

10−

1

100

False N

egative Rate

FNR

texture contrast

DC

MC

SM

ML (c)

sparsity

100

101

102

10−

3

10−

2

10−

1

100

Mism

atch Rate (δ>

1)

MR

texture contrast

DC

MC

SM

ML

(d)inaccuracy

100

101

102

10−

2

10−

1

100

Occlusion B

oundary Accuracy

OBA

texture contrast

DC

MC

SM

ML

(e)occlusion

artifacts

100

101

102

10−

5

10−

4

10−

3

10−

2

10−

1

100

Occlusion B

oundary Accuracy −

FP

OBFPR

texture contrast

DC

MC

SM

ML

(f)occlusion

FP

100

101

102

10−

4

10−

3

10−

2

10−

1

100

Occlusion B

oundary Accuracy −

FN

OBFNRtexture contrast

DC

MC

SM

ML

(g)occlusion

FN

100

101

102

10−

2

10−

1

100

101

Unbiasedness

B

texture contrast

DC

MC

SM

ML

(h)unbiasedness

Figu

re8:

Different

typesof

matching

errorsevaluated

onground-truth

testdata

forallthree

algorithms:

theD

isparitycom

ponentm

atchingis

labeledby

‘DC

M’,

theC

onfidently-stablem

atchingis

labeledby

‘CSM

’,and

theM

aximum

-likelihoodm

atchingis

labeledby

’ML’

32

false positive and false negative component of the error. The false positive rate at occlusionboundary (Fig. 8(f)) follows the general pattern of the overall false positive rate (Fig. 8(b))with a small but notable increase in error in the CSM and a bit higher in the DCM. Thefalse negative rate at occlusion boundary (Fig. 8(g)) is the highest in the CSM and secondhighest in the DCM, consistently over all texture contrasts. This means that the twoalgorithms tend to suppress occlusion boundary artifacts due to non-zero width of thematching window [56]. They, on the other hand, give sparser disparity map close to theboundary. This is notable mainly in Figs. 9(f) along the right side of the foreground objectwhere the thin white stripe of false positives are not marked as errors because this regionis excluded from error evaluation as discussed above. However, even in this artifact, theDCM improves the results, cf. Fig. 9(b). In the ML algorithm, the false negative rate growssteadily, which is much faster than the decrease in false positive rate. This means the MLMwill be shifting the occlusion boundary and shrinking the occluded region depending onimage contrast. The other two algorithms do not exhibit this behaviour.

Unbiasedness (Fig. 8(h)). Methods based on strong prior continuity models will beheavily biased, methods without such models will be unbiased. In agreement with thiswe can see, that the CSM and the DCM show no bias. On the contrary, the bias is verylarge in the ML algorithm, except for highest-contrast images. Under the high contrastthe ability of MLM to detect small objects is undermined by increased sparsity of the map,cf. Fig. 9(k). In very low-contrast images all algorithms show apparent unbiasedness. Thereason is different in the CSM and in the ML algorithm. In the CSM the false negativerate is high (Fig. 8(c)) in both object and background, whereas in the MLM the mismatchrate is high (Fig. 8(d)). The consequence is that the unbiasedness ratio (6) becomes ill-conditioned, which is the reason it is close to unity for small contrasts.

Disparity maps computed by the tested algorithms are shown in Fig. 9 in order toallow to study the various specific errors of the methods. The two left columns give resultsfor the intermediate image contrast of 10.5, while the two right columns for the maximumcontrast of 74.0, cf. Fig. 7. In each group, the left columns show disparity maps as obtainedfrom the respective algorithms using the same disparity color coding as the ground-truthdisparity map in Fig. 7. Gray colour represents unmatched points. The right columns showthe error disparity maps: the disparity is gray coded, while the various matching errors arecolor-coded as follows: green: false negatives, red: false positives, blue: mismatches, cyan:occlusion boundary false negatives, yellow: occlusion boundary false positives.

The thin white line on the left side of the object (clearly visible e.g. in Fig. 9(g)) marksfalse negatives in the ±2 pixel neighborhood of occlusion boundary which is excluded fromthe evaluation.

4.3 Evaluation Based on Scharstein & Szeliski’s Test Dataset

As we described in Sec. 2.2, the Scharstein & Szeliski’s methodology, proposed in [61],focuses on taxonomy and evaluation of dense two-frame stereo algorithms. In order toobtain the algorithm ranking, on four different image pairs with known ground-truth the

33

Disparity Component Matching

(a) (b) (c) (d)

Confidently-Stable Matching

(e) (f) (g) (h)

Maximum Likelihood Matching

(i) (j) (k) (l)

Figure 9: Disparity maps produced by the tested algorithms. Per columns: disparity maps together withits errors under texture contrast 10.5 (two left columns), and under texture contrasts 74.0 (two rightcolumns). Colour coding of disparity maps has been described in Fig 7. Color coding of errors: green:false negatives (FN), red: false positives (FP), blue: mismatches, cyan: occlusion boundary FN, yellow:occlusion boundary FP.

34

statistics based on the percentage of bad matching pixels is computed. The experimentsmonitor percentage of bad pixels in: non-occluded regions (BO), texture-less regions (BT )and depth discontinuity regions (BD).

The evaluation of the tested methods is shown in Tab. 1. For each image, the statisticsfor all three regions is computed. In [61], the requirement for algorithm quality is set asfollows: if the percentage of bad matching pixels in non-occluded regions is smaller than10%, the algorithm produces good results. This condition is fulfilled by all the approaches,as it can be seen in Tab. 1, however, as we know from the previous comparison, the MLapproach is not able to produce acceptable results in general, cf. Fig. 9(i). Moreover,although the CSM has better evaluation than DCM, the quality of DCM results is better.We think, the main reasons of this inconsistency are that (1) only the mismatch errors aremonitored, and (2) the error rate is evaluated with no respect to the density of results.These examples demonstrate the insufficiency and inadequateness of this evaluation.

The second possibility for comparison offered by this methodology is based on visualinformation only. The results are presented in Fig. 10. On the given four image pairs: Labscene, Venus, Sawtooth, and Map, the algorithms are examined. The input left images ofthese scenes are shown in Fig. 10 top row. In the rest of Fig. 10, disparity maps resultingfrom tested algorithms are shown: the Disparity component matching, the Confidently-stable matching, and the Maximum-likelihood matching. We can notice the dramaticimprovement of CSM results due to applying the DCM approach. The disparity maps aremuch denser, while the accuracy of results seems to remain. Only a few errors appeared(e.g. the top left corner on the Lab scene image). The ML results confirmed the “streakingeffect” and inability to detect correctly the discontinuities.

The only opportunity for detailed algorithm monitoring is given by computing the errordisparity maps focusing separately on each of the defined regions: non-occluded, texture-less, or discontinuity. The demonstration is shown on the results produced by CSM onVenus image pair in Fig. 11. However, the selected dataset is not very suitable for thestudying purpose (mainly due to geometry of scenes—slanted object or very difficult Labscene). Consequently, to discover detailed algorithms behaviour is impossible.

4.4 Comparison of Evaluation Methodologies

Our proposed methodology allows to study failure mechanisms of various binocular match-ing algorithms. Towards this purpose we designed a test dataset, where the same scene iscaptured under varying signal-to-noise ratio (SNR). The algorithms are tested under bothlow and high SNR which reveals their ability to cope with real scenes of non-uniform tex-ture contrast. In order to discover various algorithms weaknesses, the test data is designedto be targeted on a specific error mechanism, while other causes are excluded.

The evaluation is based on two different principles: graph plots and disparity mapstogether with error-disparity maps. On the graph plots, the performance comparison ofalgorithms under all ten levels of SNR is demonstrated for each error independently. Thedisparity maps allow to study the results in detail, while the error-disparity map to discoverthe areas where the algorithm fails.

35

Lab Scene Venus Sawtooth MapBO BT BD BO BT BD BO BT BD BO BT BD

DCM 2.2% 1.7% 8.6% 2% 1.6% 16% 3% 0.9% 15% 0.53% 3.2% 3.3%CSM 1.1% 0.1% 6.2% 0.73% 0.09% 11% 1.4% 0.0% 14% 0.15% 0.6% 1.5%ML 7.7% 11% 21% 9% 12% 28% 4.3% 6.1% 18% 0.75% 0.4% 7.4%

Table 1: Comparative performance based on the percentage of bad matching pixels statistics. For eachimage the columns are organized as follows: the left column corresponds to non-occluded areas (BO),middle column to low-texture regions (BT ) and the right column to the occlusion boundaries (BD).

Lab scene Venus Sawtooth Map

Disparity Component Matching

Confidently-Stable Matching

Maximum-Likelihood Matching

Figure 10: Qualitative comparison of proposed methods. The columns correspond to following images:Lab scene, Venus, Sawtooth and Map. Colour coding has been described in Fig. 7.

36

Venus

Bo (venus) = 0.73 B

t (venus) = 0.09 B

d (venus) = 11.07

Figure 11: The error disparity maps: on the left the resulting disparity map, while the error disparitymaps computed for each region separately: non-occluded, texture-less, and discontinuities.

On the contrary, the Scharstein & Szeliski’s methodology does not offer any tool forstudying the algorithms. The only type of error, which is monitored, are mismatches,that cannot describe the real algorithms behaviour. Besides, although they evaluate error-disparity maps for each region separately, due to the non-convenient selection of the dataset,it cannot be used for algorithm improvement. Consequently, it can serve only for thealgorithms ranking.

5 Thesis Plan

In our further research, we will focus on stereo matching, namely on formulation of thematching problem allowing to obtain more reliable matching costs.

In our present research, we have come to the conclusion, that dividing the matchingprocess into two semi-independent parts, where the first is aimed to compute the reliableand discriminable matching costs, while the second to establish the matching, is inevitable,if we want to find the correspondences correctly. This conclusion is derived from the “error-duality” of these two steps. Since the requirement on the second step is to produce clearresults (with no false positives and mismatches), the matching algorithm has to havechance to select in between good matching costs, otherwise the results are very sparse,cf. Fig. 9(e). Consequently, the first step should “prepare” the matching costs in sucha way. The method should find the high reliable (multi-valued) matching, i.e. perform“some kind of preselection”, where the false negatives have to be eliminated, while thefalse positives or mismatches do not cause any problems, because they will be excluded bythe later method. Whilst the second step is able to solve (as we think) the Confidently-Stable matching proposed by Radim Sara [55], the first step is still an open task. We willconcentrate on it in the thesis.

Our designed method—the Disparity component matching (DCM)—supports our ob-servations. The experiments confirmed that the research going through this way promisehigh-quality results. Since the DCM algorithm could be viewed as a kind of heuristicapproach, we want to propose a pure formulation where we: (1) propose clear (and contin-uous) signature representation, and (2) define the way of “optimality” of the preselection.

The second method proposed in this paper—the Stereo matching algorithms evalu-ation—has been designed in order to allow to study algorithms. Discovering various algo-

37

rithms weaknesses and properties is indispensable for the development and improvementof algorithms. The proposed methodology is almost ready [38]. We are now collecting newdataset and preparing the method publication. Consequently, the research on this task wetreat as the secondary goal for the thesis. Nevertheless, for the correct matching problemformulation it is highly required.

We can categorize our work plan towards the thesis into two goals:

1. Main goal: Stereo matching problem formulation

This part will be the core of our further research. The formulation will come out fromour conclusions supported by the DCM approach. However, it has to be done in non-heuristic way. A suitable optimality criterion must be defined and the correspondenceproblem solved.

We can derive a few sub-problems, which has to be answered in order to be able toformulate and solve our matching task:

• Design proper signature representation. In order to obtain discriminable signa-tures, which contributes to more reliable and accurate results, it is substantialto represent them correctly. We consider the main problem in image samplingand want to design a method based on continuous signals. The promising wayhas been indicated in [12].

• Propose better preselection method. The preselection method is required toproduce results, which ensure that the Confidently-stable matching will havethe option to establish dense correspondences. Thus, its result should be thehigh-reliable sub-set of the set of all tentative pairs with no false negative errors.The possible directions have been outlined in [19, 18].

• Incorporate non-centralized disparity neighbourhoods. In our approach, the cen-tralized windows (i.e. the point of the interest is in the centre of the non-rectangular window) for evaluating the signatures are applied. However, at theobjects boundaries, it causes suppression of even the correct matches. Therefore,the improving of the results at the object boundary is desired. We propose toapply non-centralized windows (obtained automatically in tracing componentsprocess), which goes along with [49].

2. Secondary goal: Stereo algorithms evaluation

Having a methodology for detailed algorithm studying we treat as the essential toolfor their research. Our proposed methodology has been designed for this purpose.Although it is almost finished, we are now completing the following tasks.

• Create a new test data set. Due to the inaccurate detection of ground-truth(cf. Fig. 9(a)) and higher resolution cameras, we decided to create a new dataset.Towards this purpose we work now with PhD student Jan Cech and we come

38

near to it. Our new dataset will have the same design, however, it will containmore thin stripes. Also we have improved the texture projection procedure.Nevertheless, the main contribution is more precise ground-truth.

• Propose a method for automatic parameters setting. In order to obtain compa-rable results, the key problem lies in proper selection of algorithms parameters.We plan to create a method for automatic parameter setting under the conditionof producing the best results (in terms of failure rate and false positive rate) onadditional stereo images (which are excluded from the evaluation dataset).

References

[1] Peter N. Belhumeur. A bayesian approach to binocular stereopsis. InternationalJournal of Computer Vision (IJCV), 19(3):237–262, 1996.

[2] Dinkar N. Bhat and Shree K. Nayar. Ordinal measures for visual correspondence. Re-search Report CUCS-009-96, Department of Computer Science, Columbia University,New York, February 1996.

[3] Stan Birchfield and Carlo Tomasi. A pixel dissimilarity measure that is insensitive toimage sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence,20(4):401–406, April 1998.

[4] Aaron F. Bobick and Stephen S. Intille. Large occlusion stereo. International Journalon Computer Vision (IJCV), 33(3):181–200, 1999.

[5] Robert C. Bolles, H. Harlyn Baker, and Marsha Jo Hannah. The JISCT stereo eval-uation. In Proc. DARPA Image Understanding Workshop, pages 263–274, 1993.

[6] Yuri Boykov, Olga Veksler, and Ramin Zabih. Disparity component matching forvisual correspondence. In Proceedings of International Conference on Computer Visionand Pattern Recognition, pages 470–475, 1997.

[7] Yuri Boykov, Olga Veksler, and Ramin Zabih. Markov random fields with efficientapproximations. In Proceedings of International Conference on Computer Vision andPattern Recognition, 1998.

[8] Yuri Boykov, Olga Veksler, and Ramin Zabih. A variable window approach to earlyvision. Pattern Analysis and Machine Intelligence, 20(12):1283–1294, 1998.

[9] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimiza-tion via graph cuts. In Proceedings of International Conference on Computer Vision,volume 1, pages 377–384, September 1999.

[10] John Canny. A computational approach to edge detection. Pattern Analysis andMachine Intelligence, 8(6):679–698, 1986.

39

[11] Qian Chen and Gerard Medioni. A volumetric stereo matching method: Applicationto image-based modeling. In Proceedings of International Conference On ComputerVision and Pattern Recognition, pages 29–34, 1999.

[12] Maureen Clerc. Wavelet-based correlation for stereopsis. In Proceedings 7th EuropeanConference on Computer Vision, volume 2, pages 495–509, Copenhagen, Denmark,May 2002.

[13] Ingemar J. Cox. A maximum likelihood n-camera stereo algorithm. In Proceedings ofInternational Conference on Computer Vision and Pattern Recognition, pages 733–739, Seattle, Washington, 1994.

[14] Ingemar J. Cox, Sunita L. Higorani, Satish B. Rao, and Bruce M. Maggs. A maximumlikelihood stereo algorithm. Computer Vision and Image Understanding, 63(3):542–567, May 1996.

[15] Ingemar J. Cox, Sunits Hingorani, Bruce M. Maggs, and Satish B. Rao. Stereo withoutdisparity gradient smoothing: a Bayesian sensor fusion solution. In D. Hogg andR. Boyle, editors, Proceedings of British Machine Vision Conference, pages 337–346,Leeds, UK, September 1992.

[16] T. Day and J.-P. Muller. Digital elevation model production by stereo-matching spotimage-pairs: a comparison of algorithms. Image and Vision Computing, 7(2):95–101,1989.

[17] Umesh R. Dhond and J. K. Aggarwal. Structure from stereo—a review. IEEE Trans-actions on Systems, Man, and Cybernetics, 19(6):1489–1510, 1989.

[18] Boris Flach. On solvable structural recognition problems: The maxsum problem, 2002.Personal communication.

[19] Boris Flach and Michail I. Schlesinger. A class of solvable consistent labeling problems.In F. J. Ferri, J. M. Inesta, A. Amin, and P. Pudil, editors, Advances in PatternRecognition, volume 1876, pages 652–658. 2000.

[20] Davi Geiger, Bruce Ladendorf, and Alan Yuille. Occlusions and binocular stereo.International Journal on Computer Vision, 14:211–226, 1995.

[21] Georgy Gimel’farb. Pros and cons of using ground control points to validate stereo andmultiview terrain reconstruction. Presented at Evaluation and Validation of ComputerVision Algorithms, Schloss Dagstuhl, Wadern, Germany, March 1998.

[22] Georgy Gimel’farb and Hao Li. Probabilistic regularisation in symmetric dynamicprogramming stereo. In Proceedings of Image and Vision Computing New Zealand2000, pages 144–149, November 2000.

40

[23] Minglun Gong and Yee-Hong Yang. Multi-resolution stereo matching using geneticalgorithm. In Proceedings of Workshop on Stereo and Multi-Baseline Vision, pages21–29, Kauai, Hawaii, December 2001.

[24] W. Eric L. Grimson. Computational experiments with a feature based stereo algo-rithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1):17–34,1985.

[25] Dan Gusfield and Robert W. Irving. The Stable Marriage Problem: Structure andAlgorithms. The MIT Press, 1989.

[26] Chris Harris and Mike Stephens. A combined corner and edge detector. In M. M.Matthews, editor, Proceedings of the 4th ALVEY vision conference, pages 147–151,University of Manchester, England, September 1988.

[27] Richard I. Hartley and Andrew Zisserman. Multiple View Geometry in ComputerVision. Cambridge University Press, Cambridge, UK, 2000.

[28] Heiko Hirschmuller. Improvements in real-time correlation-based stereo vision. InProceedings of Workshop on Stereo and Multi-Baseline Vision, pages 141–148, Kauai,Hawaii, December 2001.

[29] Yuan C. Hsieh, David M. McKeown, Jr., and Frederic P. Perlant. Performance evalu-ation of scene registration and stereo matching for cartographic feature extraction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):214–237,February 1992.

[30] Hiroshi Ishikawa and Davi Geiger. Occlusions, discontinuities, and epipolar lines instereo. In Proceedings of the 5th European Conference on Computer Vision, Freiburg,Germany, June 1998.

[31] Bela Julesz. Towards the automation of binocular depth perception (automap-1). InIFIPS Congress, Munich, Germany, 1962.

[32] Takeo Kanade and Masatoshi Okutomi. A stereo matching algorithm with an adap-tive window: Theory and experiment. IEEE Transactions on Pattern Analysis andMachine Intelligence, 16(9):920–932, September 1994.

[33] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with oc-clusions using graph cuts. In Proceedings of the 8th International Conference onComputer Vision, Vancouver, Canada, July 2001.

[34] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with occlu-sions via graph cuts. Technical Report TR2001-1838, Computer Science Department,Cornell University, Ithaca, NY 14853, USA, 2001.

41

[35] Vladimir Kolmogorov and Ramin Zabih. Multi-camera scene reconstruction via graphcuts. In Proceedings 7th European Conference on Computer Vision, volume 3, pages82–96, Copenhagen, Denmark, May 2002.

[36] Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized viagraph cuts? In Proceedings 7th European Conference on Computer Vision, volume 3,pages 65–81, Copenhagen, Denmark, May 2002.

[37] Andreas Koschan. Methodic evaluation of stereo algorithms. In R. Klette and W. G.Kropatsch, editors, Proc. 5th Workshop on Theoretical Foundations of Computer Vi-sion, volume 69 of Mathematical Research, pages 155–166, Berlin, Germany, 1992.

[38] Jana Kostkova, Jan Cech, and Radim Sara. Stereo algorithm evaluation methodol-ogy. Research report, Center for Machine Perception, K333 FEE, Czech TechnicalUniversity, 2002. In preparation.

[39] Jana Kostkova and Radim Sara. Disparity components matching revisited. ResearchReport CTU–CMP–2002–08, Center for Machine Perception, K333 FEE, Czech Tech-nical University, Prague, Czech Republic, March 2002.

[40] Jana Kostkova and Radim Sara. Stable matching based on disparity components.In H. Wildenauer and W. Kropatsch, editors, Proceedings of the Computer VisionWinter Workshop 2002, pages 140–148, Wien, Austria, February 2002. PRIP.

[41] Yvan G. Leclerc, Q.-Tuan Luong, and P. Fua. Measuring the self-consistency ofstereo algorithms. In David Vernon, editor, Proceedings 6th European Conference onComputer Vision, volume 2, pages 282–298, Dublin, Ireland, June 2000.

[42] D. Marr and E. Hildreth. Theory of edge detection. Proceedings Royal Society London,B 207:187–217, 1980.

[43] D. Marr and T. Poggio. A theory of human stereo vision. A.I. Memo 451, ArtificialIntelligence Lab, Massachusetts Institute of Technology, November 1977.

[44] Jirı Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust wide baselinestereo from maximally stable extremal regions. In Proceedings of British MachineVision Conference, September 2002. to appear.

[45] H. P. Moravec. Towards automatic visual obstacle avoidance. In Proceedings of 5thInternational Joint Conference on Artifficial Intelligence, page 584, 1977.

[46] Jane Mulligan, Volkan Isler, and Kostas Daniilidis. Performance evaluation of stereofor tele-presence. In Proc. of International Conference on Computer Vision, 2001.

[47] Yuichi Ohta and Takeo Kanade. Stereo by intra- and inter-scanline search usingdynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 7(2):139–154, 1985.

42

[48] Masatoshi Okutomi and Takeo Kanade. A multiple-baseline stereo. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 15(4):353–363, 1993.

[49] Masatoshi Okutomi, Yasuhiro Katayama, and Setsuko Oka. A simple stereo algorithmto recover precise object boundaries and smooth surfaces. In Proceedings of Workshopon Stereo and Multi-Baseline Vision, pages 158–165, Kauai, Hawaii, December 2001.

[50] Stephen B. Pollard, John E. W. Mayhew, and John P. Frisby. PMF: A stereo corre-spondence algorithm using a disparity gradient limit. Perception, 14:449–470, 1985.

[51] Sebastien Roy. Stereo without epipolar lines: A maximum-flow formulation. Interna-tional Journal of Computer Vision, 34(2/3):147–161, 1999.

[52] Sebastien Roy and Ingemar J. Cox. A maximum-flow formulation of the n-camerastereo correspondence problem. In The 6th International Conference on ComputerVision, 1998.

[53] Radim Sara. The class of stable matchings for computational stereo. Research Re-port CTU-CMP-1999-22, Center for Machine Perception, Czech Technical University,Prague, Czech Republic, November 1999.

[54] Radim Sara. Sigma-delta stable matching for computational stereopsis. Research Re-port CTU–CMP–2001–25, Center for Machine Perception, Czech Technical University,Prague, Czech Republic, September 2001.

[55] Radim Sara. Finding the largest unambiguous component of stereo matching. InProceedings 7th European Conference on Computer Vision, volume 3, pages 900–914,Copenhagen, Denmark, May 2002.

[56] Radim Sara and Ruzena Bajcsy. On occluding contour artifacts in stereo vision. InDeborah Plummer and Ian Torwick, editors, Proceedings of the International Con-ference on Computer Vision and Pattern Recognition, pages 852–857, Los Alamitos,California, June 1997.

[57] Kiyohide Satoh and Yuichi Ohta. Occlusion detectable stereo using a camera matrix.In Proc 2nd Asian Conf. on Computer Vision, volume 2, pages 331–335, 1995. Avail-able also from http://image-gw.esys.tsukuba.ac.jp/research/SEA/sea.html.

[58] Kiyohide Satoh and Yuichi Ohta. Occlusion detectable stereo — systematic compar-ison of detection algorithms. In Proceedings of International Conference on PatternRecognition, 1996.

[59] Daniel Scharstein and Richard Szeliski. Stereo matching with nonlinear diffusion.International Journal of Computer Vision, 28(2):155–174, 1998.

43

[60] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Technical Report MSR-TR-2001-81, Mi-crosoft Corporation, Redmond, WA 98052, USA, November 2001. The evaluationpage url: http://www.middlebury.edu/stereo.

[61] Daniel Scharstein, Richard Szeliski, and Ramin Zabih. A taxonomy and evaluationof dense two-frame stereo correspondence algorithms. In Proceedings of Workshop onStereo and Multi-Baseline Vision, pages 131–140, Kauai, Hawaii, December 2001.

[62] Milan Sonka, Vaclav Hlavac, and Roger D. Boyle. Image Processing, Analysis andMachine Vision. PWS, Boston, USA, second edition, 1998.

[63] Jian Sun, Heung-Yeung Shum, and Nan-Ning Zheng. Stereo matching using beliefpropagation. In Proceedings 7th European Conference on Computer Vision, volume 2,pages 510–524, Copenhagen, Denmark, May 2002.

[64] Richard Szeliski. Prediction error as a quality metric for motion and stereo. In Proc.7th IEEE International Conference on Computer Vision, volume 2, pages 781–788,Los Alamitos, CA, September 1999.

[65] Richard Szeliski and Daniel Scharstein. Symmetric sub-pixel stereo matching. InProceedings 7th European Conference on Computer Vision, volume 2, pages 525–540,Copenhagen, Denmark, May 2002.

[66] Richard Szeliski and Ramin Zabih. An experimental comparison of stereo algorithms.In Proc. Vision Algorithms: Theory and Practice Workshop, Greece, September 1999.

[67] Hai Tao, Harpreet S. Sawhney, and Rakesh Kumar. A global matching framework forstereo computation. In Proceedings 8th International Conference on Computer Vision,volume 1, pages 532–539, Vancouver, Canada, July 2001.

[68] Tinne Tuytelaars and Luc Van Gool. Wide baseline stereo matching based on local,affinely invariant regions. In Majid Mirmehdi and Barry Thomas, editors, ProceedingsBritish Machine Vision Conference, volume 2, pages 412–425, University of Bristol,September 2000.

[69] Olga Veksler. Stereo matching by compact windows via minimum ration cycle. InProc. 8th International Conference on Computer Vision, volume 1, pages 540–547,Vancouver, Canada, July 2001.

[70] Ye Zhang and Chandra Kambhamettu. Stereo matching with segmentation-basedcooperation. In Proceedings 7th European Conference on Computer Vision, volume 2,pages 556–571, Copenhagen, Denmark, May 2002.

[71] C. Lawrence Zitnick and Takeo Kanade. A cooperative algorithm for stereo match-ing and occlusion detection. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(7):675–684, July 2000.

44

center for machine perception stereoscopic matching...

Documents