an efﬁcient algorithm for depth image...

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. VRCAI 2010, Seoul, South Korea, December 12 – 13, 2010. © 2010 ACM 978-1-4503-0459-7/10/0012 $10.00

An Efficient Algorithm for Depth Image Rendering

Cesar Palomo∗

Computer Science DepartmentPUC-Rio

Marcelo Gattass†

Computer Science DepartmentPUC-Rio

(a) Left view (b) Right view (c) Composite view

Figure 1: Pair of reference images are warped to a new viewpoint and composited in real time.

Abstract

As depth sensing devices become popular and affordable, end-userscan have the freedom to choose a different point of view in a multi-camera transmission. In this paper, we propose an image-basedrendering (IBR) algorithm to generate perceptually accurate vir-tual views of a scene in real-time. The proposed algorithm is im-plemented and tests with publicly available datasets are presented.From these results one can evaluate the efficiency and quality of thealgorithm.

CR Categories: I.3.3 [Computer Graphics]: Picture/ImageGeneration—Viewing algorithms; I.3.7 [Computer Graphics]:Three-Dimensional Graphics and Realism—Virtual reality; I.4.9[Image Processing and Computer Vision]: Applications

Keywords: Image-based rendering, GPU programming, depth im-ages

1 Introduction

The generation of novel views from acquired imagery is motivatedby several applications in computer games, sports broadcasting, TVadvertising, cinema and entertainment industry. In case of ambigu-ity in a soccer game, for instance, many input views may be used to

∗e-mail:[email protected]†e-mail:[email protected]

synthesize a new view at a different angle to help referees inspectfor events such as fouls or offsides.

With the recent development and promissing popularity of depthsensing devices, such as depth range sensors and time-of-flightcameras, practical techniques for segmentation, compression, fil-tering and rendering of depth images arise as interesting and usefulresearch topics.

This work proposes an algorithm for rendering synthetic novelviews of a scene captured by real cameras, capable of generatingvisually accurate images at interactive rendering rates, using a rel-atively small number of input views. Calibration data and depthimages, i.e., color images along with their dense depth maps, areused as the sole input of our algorithm.

We aim to provide an IBR algorithm that runs entirely on the GPU,not only to guarantee good performance, but also to leave the CPUfree to perform other tasks, e.g. user interface tasks. An importantproperty of the proposed algorithm is that it allows for automaticprocessing of the imagery, without need for pre- or post-processingstages.

The rest of this document is organized as follows. Section 2presents a review of related research in IBR. Section 3 depicts thedepth image representation and how images can be composited forvirtual synthesis. Section 4 details all the steps involved in the pro-posed method and their implementation on the GPU. Test resultsare shown in Section 5, and concluding remarks and future workdirections are presented in Section 6.

2 Related Work

After pioneer works on IBR in the mid-90’s [Levoy and Hanrahan1996][McMillan and Bishop 1995][Gortler et al. 1996], several sys-tems have been proposed to extract models from images and usethem as geometric proxies during rendering. That made possibleto considerably reduce the density of input images necessary forhigh quality rendering. A good review on dense two-frame stereo,a commonly used technique to obtain models from input images,can be found in [Scharstein et al. 2002].

271

[Pulli et al. 1997] represent the scene as a collection of textureddepth maps and blend three closest depth maps to generate an im-age from a novel viewpoint. They also introduce a soft z-buffer todeal with occlusions and blend reference cameras’ pixels based onthe proximity to the new viewpoint. [Debevec et al. 1998] proposean efficient method for view-dependent texture mapping, allow-ing for real-time implementations using graphics hardware built-intechnology. [Buehler et al. 2001] present a principled study on thecompositing stage.

High quality rendering results have been reported by [Zitnick et al.2004], using a modest number of input reference cameras. Inspiredby Layered Depth Images[Shade et al. 1998], they augment depthmaps adding a second layer with information only at locations neardepth discontinuities. This layer contains matting information toimprove the rendering quality at objects borders. The identificationof depth discontinuities and the matting calculation are computedin an offline pre-processing. They also build a separate mesh in theCPU to deal with discontinuities in depth.

This paper proposes an algorithm based on the ideas presented in[Zitnick et al. 2004]. The devised method does not require pre- orpost-processing stages and can be fully implemented in the graphicshardware, avoiding costly CPU-GPU transfers.

3 Depth Image Rendering

To establish notation, in this section we review basic concepts ofdepth image rendering.

3.1 Pinhole Camera Model

The image acquisition process can be represented by a pinhole cam-era model[Hartley and Zisserman 2000], in which the imaging pro-cess is a sequence of transforms between different coordinate sys-tems. Each input camera contains associated calibration data: aview matrix V4,4 that determines camera’s position and pose, and acalibration data K3,4 with intrinsic properties used for perspectiveprojection: field of view, aspect ratio, skew factor and image opticalcenter[Hartley and Zisserman 2000].

Using the pinhole camera model, a point p(x, y, z) written origi-nally in a global coordinate system can be converted to the camera’scoordinate system and then finally to the image coordinate systempi(u, v) through the sequence of transforms defined in equation 1(using homogeneous coordinates):

pi(uw, vw,w)T = KVp(x, y, z, 1)T . (1)

3.2 Depth Map Representation

The representation composed of a color image and a dense depthmap, with one associated depth value for each pixel in the image,is called throughout this paper as a depth image. An example isshown in Figure 2.

Depth stored in a depth map can be relative to any selected coordi-nate system. Let us assume that depth in a collection of depth mapsis written in a common global coordinate system. It means that eachpixel pi(u, v) in a depth map stores an associated z value. Equation1 can be used to retrieve a pixel’s global coordinate p(x, y, z).

One detail to notice is that depth maps are usually gray-level im-ages, commonly in 8-bit format. For that reason, a linearization ofactual depth z into a depth level d is usually applied before storage.Equation 2 shows a possible linearization of d in range [0, 255].

zmin and zmax represent respectively the minimum and the maxi-mum value of z in each depth map.

d = 255

1zmax

− 1z

1zmax

− 1zmin

. (2)

The inverse of Equation 2 can be used to derive z from a pi(u, v)fetched from the depth map.

3.3 Forward Mapping

View-dependent texture mapping[Debevec 1996][Debevec et al.1998] has been proposed to render a virtual image by compositingmultiple views of a given scene. It is a forward mapping method: atextured 3D mesh is unprojected from a reference camera’s coordi-nate system to a global coordinate system, and finally warped to thevirtual viewpoint. Basically, Equation 1 can be used for unproject-ing a pixel pi(u, v) and its depth z, and finally the virtual camera’sprojection matrix is applied to warp to the novel view.

3.3.1 Occlusions

Although straightforward to implement in graphics hardware, the3D warping technique fails to produce good results at occluded re-gions, which are not sampled in a given reference viewpoint but canget revealed as the virtual viewpoint moves. Although additionalreference cameras can be used to fill the missing information, oc-clusions still need to be identified and handled to avoid the rubbersheets problem[Mark et al. 1997].

3.3.2 Compositing

Soft z-buffer and weighting are generally used to compose multi-ple views[Buehler et al. 2001]. The soft z-buffer uses a tolerancefor the z-test among warped pixels. Pixels with similar z-valuesget composited through weighting based on proximity to the virtualviewpoint. Otherwise only the closest pixel contributes to the finalcolor at the virtual image.

4 GPU Algorithm

The steps of the proposed method are depicted in Figure 3. Thefollowing subsections present each of the steps and give details onhow they can be implemented in graphics hardware.

4.0.3 Mesh Creation

The first step of the method is to create a 3D mesh for warping. Aset ofW x H vertices, corresponding to the resolution of the inputimages, are stored in a vertex buffer. That data structure is createdonce and reused at every render cycle, optimizing for speed. Ifall input images from different reference cameras have the sameresolution, only a single buffer needs to be created.

The x and y components of the vertex buffer are the actual pixellocations in the depth map, i.e., x ∈ (0..W − 1), y ∈ (0..H − 1).That defines a regular grid where the z coordinate will be calculatedin the GPU, before warping.

4.0.4 Render Reference Cameras

Similar to [Zitnick et al. 2004], relative to the position of the newviewpoint, the closest pair of reference cameras are chosen and usedfor render. The use of only two reference cameras may cause somevisible artifacts in regions not sampled by either camera, but that

272

Figure 2: Example of a depth image: color image + dense depth map. Darker pixels in depth map mean greater depth. Courtesy of Zitnicket al.

Figure 3: After an initial 3D mesh creation in the CPU, 3D warping, occlusions identification, soft Z-buffering and blending stages runentirely on graphics hardware. The other steps represent inexpensive render calls by the CPU.

problem should not happen when the input cameras baseline is nottoo wide.

The pair of reference cameras is rendered one at a time: input cam-era’s calibration and viewing matrices are pre-multiplied and sentto the GPU as a single matrix, and the vertex buffer created duringsetup is rendered. Both color image and depth map are sent to theGPU as textures for use during warping.

4.0.5 3D Warping and Occlusions Identification

These steps are implemented in a programmable vertex shader.They are performed once for each reference camera.

For a vertex (u, v), depth level d is fetched in the correspondingposition in the depth map, and converted to the actual z value. Thenthe world coordinates p(x, y, z) for that vertex are calculated. Thatpoint is finally warped to the new viewpoint using the current viewand projection matrices for the new vantage point.

In the algorithm presented here, the identification of occlusions isdone using the gradient of depth z(u, v) in the depth map, definedin Equation 3. The vertex shader makes three texture accesses toretrieve the depth levels and then converts them to actual depth. Wedefine a threshold τG so that when ||∇z(u, v)|| > τG, this vertexis considered occluded and its alpha channel is set to 0. Otherwise,its alpha is set to 1.

∇z(u, v) = (z(u+ 1, v)− z(u, v), z(u, v + 1)− z(u, v)) . (3)

The output of the vertex shader is the occlusion label stored in thealpha channel, the vertex position and texture coordinates for colortexture, which will be interpolated by the hardware rasterizer. Fi-nally a fragment shader performs the perspective division and cal-culates pixel’s depth and color (along with the alpha channel forocclusion handling). The results of this step are stored in a framebuffer object for further compositing.

4.0.6 Compositing

After both reference cameras results have been rendered to textureswith color and depth information, a quad is drawn to the screento activate the compositing step, which is entirely done by a pro-grammable fragment shader.

4.0.7 Soft Z-Buffering and Blending

The basic functioning of the soft Z-buffer follows. Depths fromboth reference cameras are fetched from the corresponding depthtextures. When they differ above a threshold τz , only the closestpixel contributes to the final color at the rendered image, at fullopacity. Otherwise, blending is performed.

The blending contribution weights are based on the angular dis-tances[Porquet et al. 2005], as shown in Figure 4. We unproject thepixel to the world coordinate system and define line segments link-ing that point to the reference cameras’ center. Angular distances θiand θi+1 are used to measure the influence of views i and i+1. Forsmooth color interpolation, [Buehler et al. 2001] suggests a cosine-based weight. To avoid two cosine calculations, we calculate onlyweight wi for view i so that wi ∈ [0, 1] using Equation 4, and setthe other reference camera’s weight as wi+1 = 1− wi.

wi = 0.5(1 + cosπθi

θi + θi+1). (4)

We incorporate in the soft Z-buffer algorithm the occlusion infor-mation α stored as the alpha channel at the color image’s pixel. Thefinal pixel color can be calculated with:

color =αiwicolori + αi+1wi+1colori+1

αiwi + αi+1wi+1. (5)

When the denominator in Equation 5 is zero (when pixels from bothcameras are marked occluded), we use arbitrarily one of the cam-eras pixel color to fill occlusions.

Figure 5 illustrates some steps of the proposed framework.

273

Figure 4: Visibility, angular distances and occlusions are used to determine final color.

Figure 5: Results for a reference camera and final result. The first image shows rubber sheets at objects borders, which are identified usingthe module of the depth gradient and encoded using the alpha channel, shown in the second image. Those regions appear erased in the thirdimage. The last image shows the final result after compositing the pair of reference images.

5 Experimental Results

To verify the effectiveness of the proposed method, we used thesequences Ballet and Breakdancers generated and provided by theInteractive Visual Media Group at Microsoft Research. Both se-quences consist of 100 color images each of real dynamic scenes,taken from 8 different stationary vantage points, along with high-quality depth maps[Zitnick et al. 2004]. All depth images have res-olution of 1024 x 768 pixels, captured at 15 FPS.

The proposed algorithm was implemented and tested in a worksta-tion equipped with a Intel Core 2 Quad 2.4GHz, with 2GB RAMand a NVidia GeForce 9800 GX2 GPU, with 512 MB memory. Al-though a more systematic test should be performed to attest theinteractive performance, the observed rendering rates (above 100FPS for tested images) suggest that the proposed system could beapplied to higher resolution images while keeping the real-time per-formance.

The visual accuracy metric used in the experiments was the peaksignal-to-noise ratio (PSNR). PSNR is commonly used to measurethe human perception of reconstruction quality, working as an eval-uation function based on the squared-error between two images,calculated with Equation 6.

The tests methodology consisted in using a reference camera’s colorimage I as basis and recreate the color image Iv from that vantagepoint. Iv with resolution W x H could then be compared againstI using PSNR with the mean squared error pixelwise (luminancevalues from 0 to 255 were used) as shown in equation 7. A 10-pixelwide border of the resulting image was cropped prior to the PSNRcalculation, since most algorithms do not deal well with image bor-ders.

MSE =1

W H

W−1∑i=0

H−1∑j=0

[I(i, j)− Iv(i, j)]2. (6)

PSNR = 10 · log10(

2552

MSE

). (7)

We ran tests for each of the 100 frames for both depth images se-quences, setting the camera 4 as the virtual viewpoint and usingreference cameras 3 and 5 during rendering (refer to [Zitnick et al.2004] for more information on their cameras setup). Figure 6 de-picts the PSNR results for the color images in the Ballet sequence,while Figure 7 shows the results for the Breakdancers sequence.In both cases, compared to an implementation using 3D warpingfollowed by blending based on angular distances, the proposed al-gorithm considerably increases the PSNR perceptual quality, main-taining highly interactive rendering rates.

6 Conclusion and Future Work

In this paper, we proposed an integrated algorithm to render virtualviewpoints using depth images and calibration data, which makesmassive use of graphics hardware for mantaining high interactiverates.

The perceptual visual quality of images rendered using theproposed method is comparable to considered state-of-the-artwork[Zitnick et al. 2004]. In our evaluation, possible visible ar-tifacts could be significantly diminished through color balancinginput images. The algorithm could also benefit from matting infor-mation, but since current methods are done offline, they were notadded to our framework.

The presented algorithm can be seen as a good tradeoff betweenvisual quality and speed, and can be easily implemented for freevirtual viewpoint control by the user in live transmissions withfairly good results, enhancing the viewer experience. The proposedpipeline also allows for easy extension and improvement.

274

Figure 6: Results for tests with all 100 frames of Breakdancers sequence.

Figure 7: Results for tests with all 100 frames of Ballet sequence.

275

References

BUEHLER, C., BOSSE, M., MCMILLAN, L., GORTLER, S., ANDCOHEN, M. 2001. Unstructured lumigraph rendering. In InComputer Graphics, SIGGRAPH 2001 Proceedings, 425–432.

DEBEVEC, P., YU, Y., AND BOSHOKOV, G. 1998. Efficientview-dependent image-based rendering with projective texture-mapping. Tech. Rep. UCB/CSD-98-1003, EECS Department,University of California, Berkeley.

DEBEVEC, P. E. 1996. Modeling and Rendering Architecture fromPhotographs. PhD thesis, University of California at Berkeley,Computer Science Division, Berkeley CA.

GORTLER, S. J., GRZESZCZUK, R., SZELISKI, R., AND COHEN,M. F. 1996. The lumigraph. In SIGGRAPH ’96: Proceedingsof the 23rd annual conference on Computer graphics and inter-active techniques, ACM, New York, NY, USA, 43–54.

HARTLEY, R. I., AND ZISSERMAN, A. 2000. Multiple View Ge-ometry in Computer Vision. Cambridge University Press, ISBN:0521623049.

LEVOY, M., AND HANRAHAN, P. 1996. Light field rendering.In SIGGRAPH ’96: Proceedings of the 23rd annual conferenceon Computer graphics and interactive techniques, ACM, NewYork, NY, USA, 31–42.

MARK, W. R., MCMILLAN, L., AND BISHOP, G. 1997. Post-rendering 3d warping. In In 1997 Symposium on Interactive 3DGraphics, 7–16.

MCMILLAN, L., AND BISHOP, G. 1995. Plenoptic modeling: animage-based rendering system. In SIGGRAPH ’95: Proceed-ings of the 22nd annual conference on Computer graphics andinteractive techniques, ACM, New York, NY, USA, 39–46.

PORQUET, D., DISCHLER, J.-M., AND GHAZANFARPOUR, D.2005. Real-time high-quality view-dependent texture mappingusing per-pixel visibility. In GRAPHITE ’05: Proceedings of the3rd international conference on Computer graphics and interac-tive techniques in Australasia and South East Asia, ACM, NewYork, NY, USA, 213–220.

PULLI, K., COHEN, M., DUCHAMP, T., HOPPE, H., SHAPIRO,L., AND STUETZLE, W. 1997. View-based rendering: Visu-alizing real objects from scanned range and color data. In InEurographics Rendering Workshop, 23–34.

SCHARSTEIN, D., SZELISKI, R., AND ZABIH, R. 2002. A taxon-omy and evaluation of dense two-frame stereo correspondencealgorithms. International Journal of Computer Vision 47, 7–42.

SHADE, J., GORTLER, S., HE, L.-W., AND SZELISKI, R. 1998.Layered depth images. In SIGGRAPH ’98: Proceedings of the25th annual conference on Computer graphics and interactivetechniques, ACM, New York, NY, USA, 231–242.

ZITNICK, L. C., KANG, S. B., UYTTENDAELE, M., WINDER,S., AND SZELISKI, R. 2004. High-quality video view interpo-lation using a layered representation. In SIGGRAPH ’04: ACMSIGGRAPH 2004 Papers, ACM, New York, NY, USA, 600–608.

276

an efﬁcient algorithm for depth image...

Documents