stereoscopic ranging by matching image modulations - …ieee transactions on image processing, vol....

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO. 6, JUNE 1999 785

Stereoscopic Ranging byMatching Image Modulations

Tieh-Yuh (Terry) Chen,Member, IEEE,Alan Conrad Bovik,Fellow, IEEE,and Lawrence K. Cormack

Abstract—We apply an AM–FM surface albedo model toanalyze the projection of surface patterns viewed through abinocular camera system. This is used to support the use ofmodulation-based stereo matching where local image phase isused to compute stereo disparities. Local image phase is anadvantagous feature for image matching, since the problem ofcomputing disparities reduces to identifying local phase shiftsbetween the stereoscopic image data. Local phase shifts, however,are problematic at high frequencies due to phase wrapping whendisparities exceed��. We meld powerful multichannel Gaborimage demodulation techniques for multiscale (coarse-to-fine)computation of local image phase with a disparity channel modelfor depth computation. The resulting framework unifies phase-based matching approaches with AM–FM surface/image models.We demonstrate the concepts in a stereo algorithm that generatesa dense, accurate disparity map without the problems associatedwith phase wrapping.

I. INTRODUCTION

I N RECENT years there has been a growing interest in theutilization of range images to map scenes arising in indus-

trial, medical, military and other applications. Computationalstereopsis, where disparity is computed from two or morecameras with displaced image planes and then converted torange via triangulation, is a powerful means for the passiveacquisition of range information. Passive sensing has theadvantages of requiring less sensor sophistication and, inrelevant applications, reduced sensor detectability. Stereo-scopic ranging also is simple in concept and implementation.Further, very successful biological stereo ranging systemsare available for study and emulation. Virtually identicalstereo algorithms have evolved independently across classes(mammals and birds) and perhaps species (carnivores andprimates), indicating that there may be a single, best stereoalgorithm that the process of evolution inevitably uncovers [1].

Computational stereopsis algorithms can generally be bro-ken into three processing steps [2]. First is image preprocess-ing, where the stereo pair is processed such that the second

Manuscript received March 13, 1997; revised August 21, 1998. This workwas supported in part by the Air Force Office of Scientific Research, AirForce Systems Command, USAF, under Grant F49620-93-1-0307, and bythe Army Research Office Contract DAAH 049510494. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Dr. Goesta H. Granlund.

T.-Y. Chen is with the Institute of Science and Technology, Taiwan, R.O.C.A. C. Bovik is with the Center for Vision and Image Sciences, Department

of Electrical and Computer Engineering, The University of Texas at Austin,Austin, TX 78712-1084 USA (e-mail: [email protected]).

L. K. Cormack is with the Center for Vision and Image Sciences, De-partment of Psychology, The University of Texas at Austin, Austin, TX78712-1084 USA.

Publisher Item Identifier S 1057-7149(99)04053-1.

stage (matching) is made easier, either by enhancing featuresto be matched or by decomposing the images into subimages tobe separately matched. In the second stage, matching featuresfrom each image are detected (either at sparse spatial locations[3]–[10], or at all the image points [11]–[13]), and it is decidedwhich points in one image match to which points from theother. This solves the “correspondence problem,” the mostdifficult aspect of the stereo ranging problem. The means forsolving it is what differentiates the many algorithms that havebeen proposed. The relative shift between the coordinates ofthe matched points is thedisparity, the computation of whichdepends on the geometry of the cameras. Finally, the thirdstep involves calculating range from disparity using the knowncamera geometries.

An important class of approaches that deliver dense depthresults involves the use of measured image phase as matchingfeatures [14]–[16], [42]–[45]. Sanger [14] uses a bank of one-dimensional (1-D) Gabor filters (aligned along the epipoles)to decompose the stereo image data into different scale-subimages. Local instantaneous frequencies are approximatedby the center frequencies of the Gabor filters (introducing anerror). Pointwise phase differences are then computed betweenthe stereoscopic images (without the need for matching).Disparities from each channel are computed from the phasedifferences and averaged over spatial scales to give a final re-sult. The approach has advantages: search-free correspondencecomputation and, hence, depth computation without solvingan intense optimization problem. Sanger’s algorithm has somelimitations, owing to the coarse approximations used, and itproduces oversmoothed disparity maps.

Jepson and Jenkin [15] also proposed a simple schemefor rapidly computing disparities from the output phase dif-ferences of 1-D Gabor filters. They formulated a multiscaledisparity computation, rather than simply averaging over chan-nels. Their first approach also used coarse phase approxima-tions, failing to recover disparities densely, e.g., in simplerandom-dot stereograms. However, in a later effort [44], theyutilized auxiliary methods for inferring local surface structures,which produced denser and more detailed results.

Langley et al. [16] developed an algorithm using the re-sponses of both 1-D and two-dimensional (2-D) Gabor filters.Disparities were estimated by solving a first-order differentialequation via a Newton–Raphson root-finding iteration. Theapproach suffered from problems related to unstable phasewrapping in the computation of local image phase; this wasameliorated to some degree by approximating the instanta-neous phase to assist in the disparity computations.

1057–7149/99$10.00 1999 IEEE

786 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO. 6, JUNE 1999

In a deeper study of the problem of computing phase-baseddisparities, Fleetet al. [42], [43] examined the stability ofstereoscopic phase computations, and the role of phase discon-tinuities, which are found to create errors (see also [26]). Theyproposed methods to ameliorate these effects by detectingphase discontinuities and using surrounding phase informationto produce better disparity estimates. There has also beenincreasing recognition in the psychophysics community of theprobable significance of phase information in biological visualprocessing. For example, Fleetet al. [45] propose a modelof phase-based disparity processing in human visual cortex.They view phase shift information as being pooled across theresponses of neurons tuned to multiple orientations and radialspatial frequencies, producing a stable, unambiguous disparityrepresentation.

The view taken in this paper is that the use of local imagephase to compute disparities in stereoscopic images (or motionimages, for that matter) as advanced by these researchers is apowerful concept that has the potential, once fully developed,to free current matching-based image analysis problems froma dependence on solving difficult search-based correspondencecalculations. However, phase-based approaches have lacked aclear connection to the scene that is projected into the imagebeing processed.

It is a goal of this paper to better develop this connection.We employ a recently introduced and very natural imagemodel, theAM–FM image model, wherein image informa-tion is captured by the amplitude and frequency variationsof modulated 2-D sinusoidal functions. The model is quitesuccessful for analyzing nonstationary, locally coherent imagestructures [17]. A multichannel Gabor filterbank is used tomake localized measurements of the modulating functions,and to separate distinct modulating functions. Accurate mul-tiscale measurements are made of local image phase andlocal image contrast. A more complete physical meaning forthese measurements is developed by using the AM–FM modelwithin the context of surface-to-image projection: image phaseis directly related to the surface patterns. Given accuratemeasurements of local image phase, stereo disparities arecomputed from phase differences. We propose a new modelfor disparity computation utilizing the concept ofdisparitychannels, which admits a systematic approach for coarse-to-fine disparity computation that melds seamlessly with phasedisparity measurements made under the multiscale AM–FMimage model. Through these developments, we create a newstereo algorithm that recovers disparity at every pixel and thatpreserves information near depth discontinuities. Intensity dis-continuities (which typically accompany depth discontinuities)are well handled by the AM–FM image model, where they arerepresented as sudden phase shifts.

The paper is organized as follows. Section II describesa powerful AM–FM image model that is derived fromthe surface albedo and a stereoscopic imaging geometry.Section III considers potential modulation-based matchingfeatures. Among these, the local image phase is deemed mostuseful. Section IV develops a coarse-to-fine matching strategyusing a model of disparity channels designed in concert withthe Gabor filterbank. Simulation results appear in Section V.The paper concludes in Section VI.

II. FROM SURFACE MODULATION TO IMAGE MODULATION

We first describe the image modulation model that willbe used in the development and analysis of a phase-basedapproach to computed stereopsis. The model uses a simpleprojective geometry applied to a model of modulated surfacereflectivity. A simple, static nonconvergent stereo imaginggeometry is used to demonstrate principle; however, the sameapproach may be applied (with modifications) to vergent oractive stereo imaging geometries.

Most of the mathematical developments are cast in acontinuous-domain setting for simplicity. Discretization willbe considered in the context of algorithm. A sensed 2-Doptical image will be modeled by general multicomponentamplitude- and frequency-modulated (AM–FM) complexfunctions of the form indexedby 2-D coordinates , as follows:

(1)

where the amplitude modulation (contrast) functionsare the local peak-to-peak contrast of each com-

ponent and the frequency modulation functions (phases)capture locally significant frequency variations

over diverse orientations and granulations. Optical imageshave real, positive intensity distributions, whereas (1) is com-plex. As explained in Section II-E, (1) can be regarded as theresult of preprocessing a real, positive image function.

AM–FM models are becoming increasingly popular asanalysis tools for a variety of signal and image process-ing tasks. As early as 1977, Moorer [20] used 1-D modelssimilar to (1) to analyze and synthesize computer music.More recently, 1-D single component models havebeen studied [21], [22] and applied to speech signal analysis[23]. Two-dimensional models of image modulation wereintroduced and systematically developed by Boviket al. inthe context of image segmentation [24], [25], nonstationaryimage analysis [17], [26], [27], and shape-from-texture [28].Multicomponent AM–FM image models were first proposed in[26] and have recently been deeply analyzed [29], [30]. Thesestudies have shown that the model (1) can effectively capturephysically meaningful signal structures that arise from naturaland artificially occurring processes. Moreover, the concept oflocal frequency and amplitude variation, which presupposes aspecific class of modulated image structure, is quite direct andintuitive as opposed to, for example, very general statisticalmodels, such as Markov random fields, that are describedin terms of unspecified local interactions between samples.Recent work in multicomponent AM–FM modeling [29] hasshown that very general images can be represented with goodvisual accuracy by a small sum of AM–FM functions, andthat the essential structures of realistic image patterns canusually be captured with very few such components [29], [31].The main restriction is that the AM–FM functions be regularin the sense that the modulating functions be smooth; morespecifically, that they have small Sobolev norms [17], [22],[26], [29]–[31].

CHEN et al.: STEREOSCOPIC RANGING 787

In this paper, the goal is primarily to model image “emer-gent” structures, viz., that dominate the local perception ofan image, but that may also be nonhomogeneous in that thelocal spectrum varies. Various tools seek tocapture spectralnonstationarities, e.g., short-time Fourier transforms (STFT’s),time-frequency distributions, and wavelet expansions [32];however, the model (1) is quite well suited formodelingthem.Nonstationary image structures may arise in a number of ways,including optical (geometric) distortion, but we are mostlyinterested in the following three types of physical processes.1) Irregular, yet locally coherent patterns or subpatterns insurface reflectance, such as the markings on an animal, inwood, or in woven fabrics. 2) Variations in surface topographylead to apparent changes in pattern density and pose, even ifthe surface pattern is intrinsically regular. Smoothly-varyingshapes project stationary surface patterns onto nonstationaryimage patterns. The division between the processes 1) and 2)is ambiguous, and depends to a great extent on the scale ofobservation and on the surface smoothness (or lack thereof). In3), the process of imaging, or perspective projection, distortspatterns through foreshortening. All of these factors provideinformation that can be used to disambiguate image intensitiesbetween stereoscopic image frames.

A. Surface Reflectivity Model

Underlying the image modulation model is a similar surfacereflectance model, whereby the surface reflectivity or albedois assumed to take the form

(2)

where are local surface coordinates at amacroscopic scale of observation. Models of the form (1) and(2) are related to recently emerging models of natural patternformation. For example, partial differential equations that areused to describe naturally-forming surface patterns often havesolutions that have coherent AM–FM forms. These include theCross–Newell phase diffusion equations [47]–[49] and certainreaction-diffusion equations [50].

Note that

(3)

Since albedo is positive, we have the following additionalrestriction:

(4)

This constraint can be obtained in another way. If andare the extremes of over an arbitrary surface

patch , the local contrast over is, which must satisfy . From (3),

, which again yields the restriction (4).The notion of a surface coordinate system on which an

albedo function is mapped is a subtle concept that can onlybe loosely defined. In (2), it is assumed that the surface(s)

being imaged have a smooth geometry at the scale of obser-vation, allowing for a smooth interpretation of the modulatingfunctions. For example, if a field of grass on a flat plot ofearth is imaged from a distance, the surface coordinate systemmay be regarded as flat, with the fine scale surface texture(the blades of grass and their shadows) being contained in themodulating functions. Of course, the local coordinate system

will not necessarily be flat. However, the surface-to-imageprojection model that we will use assumes a local planarityon the surface, in the sense that a local coordinate system isdefined on the surface tangent plane at every surface point.Since both the albedo model and the processing are highlylocal, this is not very restrictive.

B. Relating Surface Coordinates to World Coordinates

Any surface lies in the three-dimensional (3-D) space ofworld coordinates . We will need to developa relationship between world coordinates and local surfacecoordinates. The -axis is usually defined to be orthogonalto the camera image planes, unless the cameras verge, in whichcase a nominal assumption is made. A visible surface patchcan be expressed

(5)

with surface normal

(6)

Assuming that is differentiable at a point , the surfaceorientation at is described by the localslant (the anglebetween the normal and the -axis) and thetilt (the anglebetween the -axis and the projection of the surface normalonto the plane).

The local surface coordinate system is defined similar tothat used in the shape-from-texture work of Super and Bovik[28], [33]. At each surface point, the local coordinate system

is defined so that the unit normal vectors in theanddirections lie in the surface tangent plane, and are related toslant and tilt according to [33]:

(7)

C. Relating Surface Coordinates toBinocular Image Coordinates

We take the origins of the image coordinate systemsand (Left andRight, respectively) to

be at world coordinates , where isthe camera focal length(s) and 2D is the baseline separationbetween the lens centers of the cameras.

As a demonstration of principle, we assume that the cameraoptical axes are parallel, in which case

and . The general ideashere can be applied to nonconvergent systems, but withcomplications in the geometric development [34].

We adopt the standard upright pinhole or perspective projec-tion model for image formation [35], whence it can be shown


Fig. 1. Binocular camera geometry.

that for (Left) and (Right) [33], [36]

(8)

In (8), , and whenand when . The matrix

(9)

is a composition of the transformation from the surface-to-world coordinates and projection onto the plane.Fig. 1 depicts all of the relevant quantities.

The surface coordinates can be expressed in terms of imagecoordinates by inverting (8):

(10)

where

(11)

Given a surface coordinate and assuming the cameras areaccurately positioned, its projections and are related by

(12)

and . The epipolar line for such a nonconvergentgeometry is simply the horizontal, along which a search maybe made for matches. Once a match is determined, standardtriangulation solves from the known quantities , andthe measureddisparity .

D. Mapping Surface Reflectivity to Image Intensity

The patterns that lie on a surface are deformed by thelocal surface topography (variation in slant and tilt) and byperspective projection. From (2), the irradiance of sensed (left

or right) image may be expressed (is meant to representeither or )

(13)

where the AM–FM functions in (13) are related to thoseat the surface by the transformations

. In (13), is an illumination function,and is the reflectance function that depends on thelocal surface orientation [36]. Assume that varysmoothly with , which will simplify later developments. Thisis reasonable in most instances, although not at gross intensitydiscontinuities.

E. Image Preprocessing

Since local image phase will be used to compute disparities,a mechanism for separating the modulated albedo functionsfrom illumination and reflectance effects is used. A simpledevice for doing this is a logarithmic point operation [17]

(14)

using the approximation (or nonlinear transformation)for , which is reasonable except for images

with very high local contrasts, which are infrequent in nature.The separation in (14) is done since containslow-frequency information that is not well captured by theAM–FM model. Separate stereo processing ofcould yield additional three-dimensional (3-D) data; however,


the assumption of smoothness suggests that they are poormatching primitives.

In the next section, a method for isolating the phase re-sponses is described, which involves passing the preprocessedimage through a bank of narrowband frequency- andorientation-sensitive spatial filters indexed by scale

and center frequency vector . The filters are selected sothat, among other things, the response near DC is small, henceassume

(15)

Since the filters are effectively one-sided in theirfrequency responses, it is true that

(16)

whereis the analytic imageassociated with each term

, where is a 2-D Hilbert transformdefined in the positive horizontal direction [30], [51], [52]. Theapproximation (16) is the same as assuming that the Hilberttransform of the image was taken first, and then the filtersapplied; equality would hold exactly if the filterswere strictly half-plane. The factor appears since theright-half frequency plane coefficients are not doubled priorto filtering; for convenience, this factor will be ignored (ab-sorbed) in the sequel. The Hilbert transform interpretationremoves any ambiguity in the definition of image phase, and inthe definition of instantaneous frequency. In-depth discussionsof the 2-D Hilbert transform and instantaneous frequencies areavailable in [30], [51], and [52]. With this, we may use theimage model (1), which will be used throughout the rest of thepaper as a convenient representation of the essential modulatedstructure of the input stereoscopic images.

III. D ISPARITY FROM FILTERED RESPONSES

The model (1) contains an unspecified number ofAM–FM components. Indeed, for any , there areuncountably infinite possible selections of the AM–FMfunctions and so that (2) represents the surfacepattern exactly. Small values of can usually be found suchthat (1) captures the image efficiently. The most interestinginterpretations arise from components with large averageamplitudes and smooth phase gradients. These are more likelyto have arisen from meaningful physical processes, since theyare “emergent” in the image, and they are easier to extract,particularly in the presence of noise [17], [31].

The degree of variation (deviation from linearity) of eachcomponent phase may change substantially over theimage, so the instantaneous frequency vectors maytraverse much of the frequency plane. Variations in the localsurface frequencies , the surface topography, and theprojection (14) all contribute to changes in . There-fore, it is necessary to introduce a device to extract local image

modulations. Since the distribution of modulation energy willvary with the local frequencies, an effective way to captureand identify locally significant components is by a bank ofspatially localized image filters that cover the frequency plane.Since the image may contain multicomponents , thefilters need to be spectrally localized to avoid cross-componentinterference. These familiar joint constraints are often usedto support the choice of low-uncertainty filters such as theoptimal Gabor functions [17], [24], [31], [37], [41], compactlysupported wavelets [38], prolate functions [39], etc.

A. Definition of Filterbank

We will use multiple 2-D Gabor functions ( filtersdoubly-indexed by and ) asthe optimizing minimum-uncertainty choice. They have thegeneral form

(17)

where is the modulation center frequencyand the are 2-D Gaussians

(18)

with scale parameter (fixed by ) and (uniform) aspectratio . The frequency response is

(19)

which is bandpass with aspect ratio . The Gabor filtersare parametrized by radial center frequencies

, by filter orientations , and byhalf-peak radial frequency bandwidth

(octaves), where .Fig. 2 depicts a sampling of the spatial frequency domain

by 12 Gabor filters (four scales in three orientations). Theparameters used are octave over all scales,orientations , and , and aspect ratio .In the experiments given later, this filter sampling (with )is used to capture horizontal local frequencies with a higherresolution than vertically oriented frequencies, since disparitieswill only be computed along horizontal scanlines.

The Gabor filters form a set ofspatial frequency channels,or -channels, through which images are decomposed. Theseare to be distinguished fromdisparity channels, or -channels,which will be tuned to specific stereo disparities rather thanspecific spatial frequencies.

B. Computing Local Image Phase from a Single Component

Obtaining a closed-form expression for the response of alinear -channel to an arbitrary AM–FM function is generallydifficult or impossible. However, under certain assumptions,the response of the-channel filter indexed to component

of (1)

(20)


Fig. 2. Intensity plot of 2-D Gabor frequency responses.

can be closely approximated by

(21)

The approximation (21), termedquasieigenfunction approxi-mation(QEA), has been utilized in numerous studies involvingfiltering of AM–FM signals and images [17], [22], [29], [31].The absolute error in (21) relative to (20) is tightly boundedby functionals expressing both the smoothness of and

(Sobolev norms) and the spatial durations (energyvariances) of the -channel filters . The theory behindthese approximations has been explored in detail elsewhere[17], [22], [29], [31]. In the current context, the theory applieswell since the -channel filters are low-uncertainty and themodulating functions are assumed smooth.

Using (21), the output amplitude responsescan be approximated by

(22)

while the phase responses have thesimple approximation

(23)

Direct computation of using (23) yields a wrappedphase with distances between phase discontinuities dependingon the spatial distribution of instantaneous frequencies. Whenthe instantaneous frequencies are smooth, the phase wrappingperiod can be locally approximated by

(24)

In (23), the local phase of component is extracted in-dependent of the -channel being used. In practice, QEAapproximations hold best for-channels having (relatively)strong responses, since the AM–FM signal energy is strong inthose channels. An-channel with a weak response at somecoordinate likely is not responding strongly to an AM–FM

component, hence its response may be composed primarily ofmultiple (out-of-band) components or by noise. This strategy isused later when selecting-channels to be used in the disparitycomputations.

C. Disparity from Phase Difference

Now we determine disparities from-channel responses.The aggregate response of-channel to the left image is

(25)

where are defined analogous to (16). Bythe QEA approximation (21), we write

(26)

and likewise the approximation for the right image:

(27)

We assume that the response to each s-channel isdominated by at most one AM–FM component, indexed

. Then, (26) and (27) become

(28)

(29)

This assumption is not very restrictive, since we are notinterested in separating and identifying the multiple compo-nents. If multiple components fall within the passband at somecoordinate, they may be merged into a single component.Applying (22) and (23) to the merged component yields alarger AM estimate but there will be little effect on the FMestimate.

Two merged components and(dropping dependence on ), have ap-

proximate magnitude with error less than, and approximate phase with error

less than , which we may regardas negligible when is small, viz., both components liewithin the same passband. Hence, we may regard multiplecomponents within a passband as approximated by a singlecomponent, which may be used to compute an aggregate phasedisparity between channels. Merged AM–FM componentstypically represent instances where two (or more) otherwiseglobally distinct AM–FM components become indistinct atsome spatial location.

There are three simple modulation-based functions whichcould be used to compute disparity: the AM estimate, the FMestimate, and the instantaneous frequencies. We will examineeach separately. From (22) and (23), the magnitude and phase


responses of -channel indexed (applied to the leftimage), can be computed with approximate values

(30)

(31)

1) Stereo from AM Matching: The AM function is esti-mated by dividing by evaluated at 1, asfollows:

(32)

The goal of stereo disambiguation is then to find thedisparityfunction such that the left and rightAM functions match:

(33)

Of course, it is not generally possible to find such that(33) is everywhere satisfied; however, the difference shouldbe small. Unfortunately, the difference might be small formany definitions of at most locations , since theamplitude often will vary only slightly oversubstantial image regions. For example, in the case of afronto-parallel plane with a periodic texture overlaid on it, theextracted AM–FM components will be nearly sinusoidal, withconstant amplitudes. At best, solving (33) should be regardedas an ancillary constraint. In fact, this is done later here.However, it should be observed that it is possible that theamplitudes may correlate imperfectly with the phases due tovarious surface distortions, an observation that was also madeby Fleet and Jepson [43].

2) Stereo from Instantaneous Frequency Matching:For thesame reason, the instantaneous frequencies are also ambigu-ous. Solving

(34)

will also fail over regions that are locally homogeneous infrequency. As in (33), severe, unresolvable ambiguities willarise if is used as the primary disambiguation primitive.

3) Stereo from Phase Matching:The -channels phases,which are also the FM functions, provides ample informationfor disambiguation. Attempting to solve

(35)

over the disparity function is a much better-posed problem,since unresolvable ambiguities will not often arise. Exceptionalinstances occur where there isn’tany intensity variation overa region, so the dominant AM–FM components are near-constant with zero phases. We are not aware of any approachto stereo disambiguation that claims to resolve this situation,other than through some kind of interpolation.

Thus the local -channel phase responses, or FM functions,have the best potential for computing disparities, with lo-cal measurements of AM or instantaneous frequencies beinguseful as secondary pieces of information. Therefore, we

1The gradient of phase may be estimated either by discrete approximationto derivatives, ratioing filter outputs [46], or through standard AM–FMdemodulation operators, such as the Teager operator, that yield the gradientof phase [27].

Fig. 3. Model of the disparity channel mechanism.

may regard solving (35) as the fundamental problem to beaddressed in the remainder of the paper.

One approach that is simple enough is to observe that byTaylor’s theorem

(36)

where is the first-order Taylor remainder [40]. If the phasefunction is assumed to vary smoothly, then a simple equationfor the disparity can be obtained from a single-channel [16]:

(37)

Of course, tracking to solve (37) would requireswitching -channels over space.

However, (36) is ineffective since (37) fails wherever theinstantaneous frequencies change tangibly. Reasonable resultsmight be had only from observations at a very small scale, viz.,using filters having a very small effective spatial support. How-ever, image modulations occur over a wide range of scales;indeed, the model is intrinsically and naturally multiscale. Forthis reason we next propose and motivate a multiscale, coarse-to-fine mechanism for disambiguating stereo disparities frommeasurements of the-channel phases.

IV. A D ISPARITY CHANNEL MODEL

Here, we explore a natural, multiscale method for solv-ing the stereo problem (35) using the-channel filter phaseresponses. The development uses the concept ofdisparitychannel, or -channel, which enables a search-free correspon-dence mechanism for extracting disparities and which meldsseamlessly with the -channel phase-based disparity model.The approach extends prior methods of multiscale disparitycalculation, e.g., [3], [6], [14]–[16], [42], [44].

A. Disparity Channels

For each -channel with radial frequency andorientation , define a finite set of-channelsCh , indexed

, by computing threshold responses from the-channels (with notations defined below):

ifotherwise.

(38)

The maximum of these threshold responses is used to selectthe particular disparity channel that is used at each coordinate,as described below. In (38), is a predefineddisparity


Fig. 4. Candidate disparity selection fors-channelhr;s using a bank of disparity channels.

channel width(pixels), fixed for each -channel, but varyingwith and (Section IV-D). In (38), the -channel phasedifference relative to disparity is

(39)

and where , is the -channelintensity, given by

(40)

is identical to a confidence measure used in [14]that is used to define a weighted disparity measurement acrossscales. Here, it is used to define the disparity channels. If thephase difference is small, then

(41)

which is maximized when the shifted amplitude modulationfunctions are equal. Care is needed whenfall in the range , e.g., by performing arithmetic modulo

.Thus, (38) functions as a phasor matching mechanism across

disparity space. Matching between-channel responses istriggered by a small phase difference at disparity shift ,and is subsequently graded by the similarity in the shifted-channel magnitude responses. As shown in Fig. 3, the outputs

of the -channel (38) are the disparity and the thresholdresponse .

B. Computing Candidate Disparity from a Single-Channel

Each -channel has an associated population of-channelsCh indexed , each supplyinga candidate disparity . From these, a single candidatedisparity

(42)

is selected according to

(43)

This strategy yields the disparity giving the largest thresholdresponse (38). Alternately, one could form a disparity estimateweighted across channels (e.g., by), as in [14]. However,low threshold channels have greater likelihood of contributingerrors from between-image distortions, hence a weighted strat-egy is more likely to contribute erroneous data. The processused here is depicted in Fig. 4, where the maximum channelintensity is used as the control signal to a multiplexer (Mux)which transfers the disparity output from the maximizingdisparity channel.

C. Computing Candidate Disparity Acrossa Set of Oriented -Channels

Although different -channel filter tesselations of the fre-quency plane are possible, we choose to parse the plane byradial frequency and orientation, as depicted in Fig. 2. Filtersare aligned along contours of equal radial frequency andof equal orientation. Separations between-channels alongfrequency/orientation axes are closely tied to the width ofthe -channels that should be used, as discussed later. Thistesselation allows for a coarse-to-fine processing schedule thatefficiently directs the computation of disparity to successivelyhigher resolutions.


Fig. 5. Candidate disparity selection across all orientations�s at a constant scaleFr. The “disparity candidate selection” blocks are explained in Fig. 4.

At a fixed scale, the best possible disparity is selected fromthose computed at different orientations. Thus, the disparityresponses from multiple -channels sharing a fixed radialfrequency , but across all orientationsare compared. The simplest way of doing this is by taking thedisparity computed from the most active-channel, i.e., havingthe largest -channel magnitude response to the left or rightimages, or possibly the product of responses. For simplicity,the left magnitude response is used here.2 Hence, from (42),for fixed:

(44)

where

(45)

In this way, disparities computed from local frequen-cies/phases that are out-of-band are rejected. The overallprocess is depicted in Fig. 5.

D. Relating -Channels to -Channels

The AM–FM model (1) is naturally multiscale, but it isnot quantum. In fact, it is smooth across the continuum ofscales—local image frequencies/phases may vary from “low”to “high” and smoothly over orientations. Passing an imagethrough -channels quantizes local frequencies into orientationand scale bins. This is convenient, since disparity computationscan be segregated by local orientation and scale. Likewise,defining discrete -channels makes it possible to localize thecomputations according to disparity scale. Since disparity hasa direct spatial interpretation, it is natural to control the actionof the -channels by the-channel responses.

2An effective way of verifying correspondences would then be to matchright image to left, using the right image magnitude responses. Inconsistentcorrespondences, possibly arising from occlusion, can then be discarded.

Fig. 6. Distribution of disparity channels from coarse to fine.

1) Disparity Channel Sensitivity:A natural and essential strategy of a coarse-to-fine disparity

computation that measures local phase across a range offrequency scales is that finer disparity measurements be madefrom higher frequency measurements. One reason for this isthat -channel phase responses fall within a dynamic range

, and so there is an intrinsic periodic ambiguity introducedby the wrapped phase. As indicated by (24), the ambiguityoccurs at finer scales with increased local frequency.

Therefore, the -channel widths should decrease with(be inversely proportional to) the center frequency of theassociated-channel filter. The question arises whethershould increase with , or with . While both strategies


Fig. 7. Stereo pair of turf and disparity map.

Fig. 8. Stereo pair of tree trunk and disparity map.

Fig. 9. Stereo pair of baseball on newspaper and disparity map.

are interesting, stereoscopic disparity (phase) computation in aLeft-Right camera system is primarily sensitive to horizontallyoriented phase variations. Then define thedisparity channelsensitivityby . Forcing the -channel sensi-tivity to be constant: across all scales and orientations,thus requires the-channel widths to fall off inversely with thehorizontal -channel center frequencies:

(46)

2) Channel Shutdown Mechanism:Another important consideration is that the-channel phase

responses be used only when the instantaneous frequencies fall

in-band, otherwise the signal-to-noise ratio (SNR) will suffer.This may be assured by setting the output of the-channel tozero whenever it is not true that

(47)

where is Euclidean distance and is the distanceto -channel half-peak. Thus, if at coordinate there isno AM–FM component that falls within the passband ofa particular -channel, then that channel is excluded fromdisparity computation at .


Fig. 10. Stereo pair of doorway and disparity map.

Fig. 11. Stereo pair of large rock and disparity map.

Fig. 12. Stereo pair of Pentagon and disparity map.

3) -Channel Sampling Tesselation:In a coarse-to-fineprocessing framework that uses local-channel phaseresponses, it is natural that the channels be arranged ina dyadic, wavelet-like tesselation with doubling of linearbandwidth and channel centers at higher frequencies.However, we propose a twist: instead, at a fixed orientationthe -channelhorizontal center frequencies are doubled fromcoarse-to-fine scales. Thus, forfixed, the horizontal centerfrequencies relate as follows across scales:

(48)

where the octave bandwidth is assumed here. Therefore,the -channel widths satisfy

(49)

Since the number of disparity channels is finite, a maximumdisparity may be defined, making it possible to determinethe number of -channelsassociated with each-channel. This can be set to the imagewidth, or determined in a simple way by assuming a closestpossible object distance to the camera baseline. In any case,


it follows that

(50)

since the channel has zero width (unit width in discreteimplementation).

Fig. 6 depicts a layered representation of the coarse-to-finedisparity structure that will be used in a coarse-to-fine disparitycomputation procedure that is described next.

E. Coarse-to-Fine Processing

The coarse-to-fine phase correspondence process proceedsfrom the largest scale to the finest , in amanner not unlike that proposed by Langleyet al. [16]. For

, compute the disparity map at scaleaccording to

(51)

where is computed across orientations at iteration(scale) as in (44), and where the initial disparity map

is assumed. Because of the channel shutdownmechanism, it is possible that will not have an assignedvalue. In this case, one can simply set .

V. SIMULATION

We now demonstrate a phase-based stereo algorithm definedby the developments in Section IV and earlier. The Gaborfilterbank used is depicted in Fig. 2; 12-channels distributedover four scales and along three orientations. The-channelbandwidth is one octave at every scale. The-channels weredesigned with aspect ratio ; thus, the channels havetwice the spectral resolution along the horizontal (disparity)direction. In each example, no effort was made to correct orobscure meaningless results obtained at the extreme left andright edges of each disparity map, where there is no possiblecorrespondence. In each display, disparity is coded as intensity,with nearer objects appearing brighter. The images used are256 256 pixels.

The first example (Fig. 7) depicts a slightly mounded areaof turf receding into the distance. The computed depth mapsuccessfully captures both the progression in depth as wellas the rounding of the surface. Fig. 8 depicts a gnarled treetrunk next to a creek. The algorithm was able to capture mostof the depth variation, including detail in the smaller root onthe right. The disparity map obtained from a stereo pair of abaseball (Fig. 9) appears at first to suffer discrepancies near theball’s contour. However, close inspection reveals that the eyeis fooled at these same locations, since there are patterns thatare difficult to disambiguate. The next example (Fig. 10) is astereo pair of a doorway. The algorithm does an excellent jobat capturing abrupt changes in disparity, and even succeedsin computing depth from the reflected background. Fig. 11shows a large rock having considerable surface variation,most of which is successfully captured and represented inthe computed disparity map. Finally, Fig. 12 shows the resultof applying the algorithm to the well-known stereo pair ofthe Pentagon in Washington, DC. The result obtained isquite comparable to the results obtained by computationally

expensive algorithms that rely on simulated annealing-typeoptimization to disambiguate.

VI. CONCLUSION

We applied a powerful AM–FM surface albedo model toanalyze the projection of surface patterns viewed through abinocular camera system. This led to the development ofa modulation-based stereo matching framework where localimage phase was used to compute stereo disparities. A coarse-to-fine matching strategy was utilized based on a new conceptof disparity channels which melds naturally with the naturalmultiscale AM–FM modeling paradigm and Gabor filterbankdecomposition. It is envisioned that the combined conceptsof phase-based matching and AM–FM image modeling willlead to further improvements in the future. We expect that onevaluable direction of research would be the development ofmore specific quantitative models of modulated albedo profileson different surfaces.

REFERENCES

[1] J. D. Pettigrew, “Is there a single, most-efficient algorithm for stereop-sis?,” inVision: Coding and Efficiency, C. Blakemore, Ed. Cambridge,U.K.: Cambridge Univ. Press, 1990.

[2] A. K. Jain, Fundamentals of Digital Image Processing. EnglewoodCliffs, NJ: Prentice-Hall, 1989.

[3] D. Marr and T. Poggio, “A computational theory of human stereovision,” Proc. R. Soc. Lond. B, 1979, vol. 204, pp. 301–328.

[4] W. E. L. Grimson, “A computer implementation of a theory of humanstereo vision,”Philos. Trans. R. Soc. Lond. B, vol. 292, pp. 217–253,1981.

[5] , “Computational experiments with a feature based stereo algo-rithm,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-7, pp.17–34, Jan. 1985.

[6] U. R. Dhond and J. K. Aggarwal, “Structure from stereo—A review,”IEEE Trans. Syst. Man Cybern., vol. 19, pp. 1489–1510, Dec. 1989.

[7] N. H. Kim and A. C. Bovik, “A contour-based stereo matching algorithmusing disparity continuity,”Pattern Recognit., vol. 21, pp. 505–514,1988.

[8] J. R. Jordan and A. C. Bovik, “Using chromatic information in edge-based stereo correspondence,”Comput. Vis., Graph., Image Process.:Image Understand., vol. 54, pp. 98–188, July 1991.

[9] D. Terzopoulos, “Multilevel computational processes for visual surfacereconstruction,”Comput. Vis., Graph., Image Process., vol. 24, pp.52–96, Jan. 1983.

[10] J. Y. Jou and A. C. Bovik, “Improved initial approximation andintensity-guided discontinuity detection in visible-surface reconstruc-tion,” Comput. Vis., Graph., Image Process., vol. 47, pp. 292–326, Aug.1989.

[11] M. J. Hannah, “Computer matching of areas in stereo images,” Ph.D.dissertation, Stanford Univ., Stanford, CA, 1974.

[12] S. T. Barnard, “Stochastic stereo matching over scale,”Int. J. Comput.Vis., vol. 3, pp. 17–32, 1989.

[13] J. R. Jordan and A. C. Bovik, “Using chromatic information in densestereo correspondence,”Pattern Recognit., vol. 25, pp. 367–383, Apr.1992.

[14] T. Sanger, “Stereo disparity computation using Gabor filters,”Biol.Cybern., vol. 59, pp. 405–418, 1988.

[15] A. D. Jepson and M. R. M. Jenkin, “The fast computation of disparityfrom phase differences,” inProc. IEEE Conf. Computer Vision andPattern Recognition, San Diego, CA, 1989.

[16] K. Langley, T. J. Atherton, R. G. Wilson, and M. H. E. Larcombe,“Vertical and horizontal disparities from phase,”Image Vis. Comput.,vol. 9, pp. 296–302, 1991.

[17] A. C. Bovik, N. Gopal, T. Emmoth, and A. Restrepo, “Localizedmeasurement of emergent image frequencies by Gabor wavelets,”IEEETrans. Inform. Theory, vol. 38, pp. 691–712, Mar. 1992.

[18] Y. Yang and R. Blake, “Spatial frequency tuning of human stereopsis,”Vis. Res., vol. 31, pp. 1177–1189, 1991.


[19] H. R. Wilson, R. Blake, and D. L. Halpern, “Coarse spatial scalesconstrain the range of binocular fusion on fine scales,”J. Opt. Soc.Amer. A, vol. 8, pp. 229–236, Jan. 1991.

[20] J. A. Moorer, “Signal processing aspects of computer music: A survey,”Proc. IEEE, vol. 65, pp. 1108–1137, Aug. 1977.

[21] P. Maragos, T. F. Quatieri, and J. F. Kaiser, “On amplitude andfrequency demodulation using energy operators,”IEEE Trans. SignalProcessing, vol. 41, pp. 1532–1550, Apr. 1993.

[22] A. C. Bovik, P. Maragos, and T. F. Quatieri, “AM–FM energy detectionand separation in noise using multiband energy operators,”IEEE Trans.Signal Processing, vol. 41, pp. 3245–3265, Dec. 1993.

[23] P. Maragos, T. F. Quatieri, and J. F. Kaiser, “Energy separation in signalmodulations with applications to speech analysis,”IEEE Trans. SignalProcessing, vol. 41, pp. 3024–3051, Oct. 1993.

[24] M. Clark and A. C. Bovik, “Texture discrimination using a model ofvisual cortex,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., Atlanta,GA, Oct. 14–17, 1986.

[25] A. C. Bovik, M. Clark, and W. Geisler, “Multichannel texture analysisusing localized spatial filters,”IEEE Trans. Pattern Anal. MachineIntell., vol. 12, pp. 55–73, Jan. 1990.

[26] A. C. Bovik, “Analysis of multichannel narrowband filters for imagetexture segmentation,”IEEE Trans. Signal Processing, vol. 39, pp.2025–2043, Sept. 1991.

[27] P. Maragos and A. C. Bovik, “Image amplitude and frequency demod-ulation using multidimensional energy separation,”J. Opt. Soc. Amer.,vol. 12, pp. 1867–1876, Sept. 1995.

[28] B. J. Super and A. C. Bovik, “Shape from texture using local spectralmoments,” IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp.333–343, Apr. 1995.

[29] J. P. Havlicek, D. S. Harding, and A. C. Bovik, “The multicomponentAM–FM image representation,”IEEE Trans. Image Processing, vol. 6,pp. 1094–1100, June 1996.

[30] J. P. Havlicek, “AM–FM image models,” Ph.D. dissertation, Univ. Texasat Austin, 1996.

[31] J. P. Havlicek, D. S. Harding, and A. C. Bovik, “Discrete quasieigen-function approximation for AM–FM image analysis,”IEEE Int. Conf.Image Processing, Lausanne, Switzerland, Sept. 16–19, 1996.

[32] S. Qian and D. Chen,Joint Time-Frequency Analysis. EnglewoodCliffs, NJ: Prentice-Hall, NJ, 1996.

[33] B. J. Super and A. C. Bovik, “Planar surface orientation from texturespatial frequencies,”Pattern Recognit., vol. 28, pp. 729–743, May1995.

[34] W. N. Klarquist and A. C. Bovik, “FOVEA: A foveated vergent activestereo system for dynamic three-dimensional scene recovery,”IEEETrans. Robot. Automat., vol. 14, pp. 755–770, Oct. 1998.

[35] B. K. P. Horn,Robot Vision. Cambridge, MA: MIT Press, 1986.[36] T.-Y. Chen, “Stereo disparity from local image phase: New models for

image modulation, coarse-to-fine processing, and disparity channels,”Ph.D. dissertation, Univ. Texas at Austin, May 1995.

[37] J. G. Daugman, “Uncertainty relation for resolution in space, spatialfrequency, and orientation optimized by two-dimensional visual corticalfilters,” J. Opt. Soc. Amer., vol. 2, pp. 1160–1169, 1985.

[38] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”Commun. Pure Appl. Math., vol. 41, pp. 909–996, 1988.

[39] R. Wilson and M. Spann, “Finite prolate spheroidal sequences andtheir applications II: Image feature description and segmentation,”IEEETrans. Pattern Anal. Machine Intell., vol. 10, pp. 193–203, Mar. 1988.

[40] J. E. Marsden and A. J. Tromba,Vector Calculus. San Francisco, CA:W. H. Freeman, 1981.

[41] R. L. De Valois and K. K. De Valois,Spatial Vision. Oxford, U.K.:Oxford Univ. Press, 1990.

[42] D. J. Fleet, A. D. Jepson, and M. R. M. Jenkin, “Phase-based disparitymeasurement,”Comput. Vis., Graph., Image Process.: Image Under-stand., vol. 53, pp. 198–210, 1991.

[43] D. J. Fleet and A. D. Jepson, “Stability of phase information,”IEEETrans. Pattern Anal. Machine Intell., vol. 15, pp. 1253–1268, Dec.1993.

[44] M. R. M. Jenkin and A. D. Jepson, “Recovering local surface structurethrough local phase difference measurements,”Comput. Vis., Graph.,Image Process.: Image Understand., vol. 59, pp. 72–93, 1994.

[45] D. J. Fleet, H. Wagner, and D. J. Heeger, “Neural encoding of binoculardisparity: Energy models, position shifts and phase shifts,”Vis. Res.,vol. 36, pp. 1839–1957, 1996.

[46] H. Knutsson, C. F. Westin, and G. Granlund, “Local multiscale fre-quency and bandwidth estimation,”IEEE Int. Conf. Image Processing,Austin, TX, Nov. 1994.

[47] T. Passot and A. C. Newell, “Toward a universal theory for naturalpatterns,”Physica D, vol. 74, pp. 301–352, 1994.

[48] C. Bowman, T. Passot, M. Assenheimer, and A. C. Newell, “A waveletbased algorithm for pattern analysis,”Physica D, 1998.

[49] J. D. Murray, Mathematical Biology. Berlin, Germany: Springer-Verlag, 1993.

[50] S. C. Zhu and D. Mumford, “Prior learning and Gibbs reaction-diffusion,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp.1236–1250, Nov. 1997.

[51] G. H. Granlund and H. Knutsson,Signal Processing for ComputerVision. Boston, MA: Kluwer, 1995.

[52] J. P. Havlicek, J. W. Havlicek, and A. C. Bovik, “The analytic image,”IEEE Int. Conf. Image Processing, Santa Barbara, CA, Oct. 26–29, 1997.

Tieh-Yuh (Terry) Chen (M’95) was born in Pen-Hu, Taiwan, R.O.C. on Feb.17, 1958. He received the B.S. and M.S. degrees in electrical engineering fromNational Cheng-Kung University, Tainan, Taiwan, in June 1980, and June1982, respectively. In 1995, he received the Ph.D. degree from the Universityof Texas at Austin, Austin, TX.

From 1982 to 1990, he was an Assistant Scientist in the Chung-ShanInstitute of Science and Technology (CSIST), Taiwan. He is currently a SeniorResearch Engineer at CSIST.

Alan Conrad Bovik (S’80–M’81–SM’89–F’96)was born in Kirkwood, MO, on June 25, 1958. Hereceived the B.S. degree in computer engineering in1980, and the M.S. and Ph.D. degrees in electricaland computer engineering in 1982 and 1984,respectively, all from the University of Illinois,Urbana-Champaign.

He is currently the General Dynamics EndowedFellow and Professor in the Department of Electricaland Computer Engineering, the Departmentof Computer Sciences, and the Biomedical

Engineering Program at the University of Texas at Austin, where he isthe Associate Director of the Center for Vision and Image Sciences. Duringthe Spring of 1992, he held a visiting position in the Division of AppliedSciences, Harvard University, Cambridge, MA. His current research interestsinclude digital video, image processing, computer vision, wavelets, andcomputational aspects of biological visual perception. He is a registeredProfessional Engineer in the State of Texas and is a frequent consultantto industry and academic institutions. He has published over 250 technicalarticles in these areas and holds two U.S. patents.

Dr. Bovik has been involved in numerous professional society activities,including, currently, Board of Governors, IEEE Signal Processing Society,Editor-in-Chief, IEEE TRANSACTIONS ON IMAGE PROCESSING, and EditorialBoard, PROCEEDINGS OF THEIEEE. He was the Founding General Chairmanof the First IEEE International Conference on Image Processing, Austin,TX, November, 1994. He is a recipient of the IEEE Signal ProcessingSociety Meritorious Service Award (1998), a winner of the University ofTexas Engineering Foundation Halliburton Award, and a two-time HonorableMention winner of the International Pattern Recognition Society Award forOutstanding Contribution (1988 and 1993).

Lawrence K. Cormack was born in Socorro, NM. He received the undergrad-uate degree (with high honors) from the University of Florida, Gainesville, in1986, and the Ph.D. degree in physiological optics in 1992 from the Universityof California, Berkeley.

He became an Assistant Professor of psychology at the University of Texasat Austin in 1992, where he is currently an Associate Professor.

stereoscopic ranging by matching image modulations - …ieee transactions on image processing, vol....

Documents