real-time color blending of rendered and captured...

Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2004

Real-time color blending of rendered and captured video

Erik Reinhard, Ahmet Oguz Akyuz,Mark Colbert, Charles E Hughes

School of Computer ScienceUniversity of Central Florida

Orlando [email protected]

Matthew O’Connor

Institute for Simulation and TrainingUniversity of Central Florida

Orlando FL

ABSTRACT

Augmented reality involves mixing captured video with rendered elements in real-time. For augmented realityto be effective in training and simulation applications, the computer generated components need to blend in wellwith the captured video. Straightforward compositing is not sufficient, sincethe chromatic content of video andrendered data may be very different such that it is immediately obvious whichparts of the composited imagewere rendered and which were captured.

We propose a simple and effective method to color-correct the computer generated imagery. The method relieson the computation of simple statistics such as mean and variance, but does so inan appropriately chosen colorspace - which is key to the effectiveness of our approach. By shifting and scaling the pixel data in the renderedstream to take on the mean and variance of the captured video stream, the rendered elements blend in very well.

Our implementation currently reads, color-corrects and composites video and rendered streams at a rate of morethan 22 frames per second for a 720x480 pixel format. Without color correction, our implementation generatesaround 30 frames per second, indicating that our approach comes at a reasonably small computational cost.

ABOUT THE AUTHORS

Erik Reinhard is assistant professor at the University of Central Florida and has an interest in the fields ofvisual perception and parallel graphics. He has a B.Sc. and a TWAIO diploma in computer science fromDelft University of Technology and a Ph.D. in computer science from the University of Bristol. He was apost-doctoral researcher at the University of Utah. He is founder and co-editor-in-chief of ACM Transactionson Applied Perception and is currently writing a book on High Dynamic Range Imaging.

Ahmet Oguz Akyuz is a Ph.D. student at the University of Central Florida. He studied computerengineeringand in 2003 he received his B.Sc. from the Middle East Technical University in Ankara, Turkey. His currentinterests are in computer graphics, color appearance modeling and high dynamic range imaging.

Mark Colbert is currently a senior undergraduate student at the University of Central Florida majoring inComputer Science. He currently works as an undergraduate researchassistant in UCF’s Graphics Laboratory.In addition to his academic studies, Mark has worked in the computer animation industry for over 3 yearsdesigning graphics for clientele such as the Tampa Bay Buccaneers and the Anaheim Angels.

Matthew O’Connor is the Programming Team Leader for the Media Convergence Laboratory at the Institutefor Simulation and Training. He received his B.S. in Computer Science from the University of Central Floridain 2002, and is currently pursuing his Ph.D. in Computer Science. His research interests include mixed reality,interacting agents, and distributed computing.

2004 Paper No. 1502 of 9


Charles E. Hughes is Professor of Computer Science in the School of Computer Science at theUniversity ofCentral Florida. After receiving his PhD in computer Science in 1970, he served on the faculties of ComputerScience at Penn State University and the University of Tennessee priorto joining UCF in 1980. His researchinterests are in mixed reality, distributed interactive simulation and models of distributed computation. He hasauthored or co-authored over 100 refereed research papers andsix books. He is a senior member of the IEEE,and a member of the ACM and the IEEE Computer Society.

2004 Paper No. 1502 of 9


Real-time color blending of rendered and captured video

Erik Reinhard, Ahmet Oguz Akyuz,Mark Colbert, Charles E Hughes

School of Computer ScienceUniversity of Central Florida

Orlando [email protected]

Matthew O’Connor

Institute for Simulation and TrainingUniversity of Central Florida

Orlando FL

INTRODUCTION

Modern simulation and training applications fre-quently employ a mixture of computer generated con-tent and live video. Thus multiple streams of dataneed to be composited in real-time before the re-sult may be presented to the viewer. This is easilyachieved with the alpha blending operations availableon current graphics hardware. For each frame, partsof the captured video are replaced by rendered data.This approach works well, but ignores several issueswhich may hamper the effectiveness of training andsimulation applications.

In particular, the rendered data may have a visu-ally different appearance from the captured video.Consider a computer generated vehicle being super-imposed over filmed terrain. This vehicle will onlyblend in well with the background if the modeling ofthe vehicle is carefully done, the colors of the vehicleare carefully chosen, and the lighting conditions un-der which the vehicle is rendered matches the light-ing conditions of the terrain. The latter requirementis particularly difficult to maintain.

We present a fast and automatic image analysis andcorrection technique which allows the rendered con-tent to be post-processed such that it blends in wellwith the video captured data. The work is directlyapplicable to simulation and training applications asoutlined above, as well as several other fields.

For instance in the film industry, a common taskperformed by colorists is ensuring that scenes thatfollow one another appear without unnatural tran-sitions. Since scenes may be shot with differentcameras, settings and lighting, frequently this in-volves chromatic adjustment of film. During post-processing colorists also perform a creative task,namely instilling a particular atmosphere or dramaticeffect by adjusting film color.

Similar tasks occur in digital matting applicationswhere separate video streams are composited to forma new scene with each frame containing pixels from

multiple sources. Here, usually a foreground isfilmed against a blue screen, and the blue is laterreplaced with a separately filmed background. Thefilming of foreground and background are usuallyseparated both in time and location.

In augmented reality, video from different sourcesis also composited. However, here one of the sourcesis usually computer generated in real time, and livevideo is the other source. To ensure that the computergenerated video blends in well with the live video, itwould normally be necessary to carefully match thelight sources found in the real scene with the lightsources used for rendering the computer generatedvideo stream. In addition the appearance of materi-als and geometry in the rendered scene (including thedirection of shadows), as well as the rendered contentshould be matched for a believable blend of real-timerendered data and live video.

Matching high level features such as the positionof light sources and other geometry is a complextask which requires considerable computational com-plexity in terms of analysis of the live video. Whilesuch an analysis and subsequent matching of 3D ge-ometry with live video is a worthwhile goal whichmay considerably enhance the visual experience ofthe presented material, we address the arguably sim-pler problem of matching the chromatic content in ahands-off manner. Our solution therefore has appli-cations in augmented reality, as well as more generalvideo matting applications, and could be extended toautomatically match subsequent scenes in film.

Our work extends a previous idea which al-lows the colors of a single photograph to be semi-automatically matched with the colors of anotherphotograph (Reinhard et al., 2001). We introducetwo set-and-forget user parameters which allow con-trol over the amount of blending. Whenever the en-vironment which is filmed changes considerably, orif the rendered data changes, these parameters mayneed to be readjusted. However, under normal cir-cumstances the method is robust and parameter ad-

2004 Paper No. 1502 of 9


justment is rarely necessary.By following the keep-it-simple principle the num-

ber of computations our algorithm requires per frameis kept low. Further speed-up is afforded by the factthat the analysis of each frame does not necessarilyinvolve all pixels in both video and rendered streams.The chromatic adjustment may be merged with thecompositing step to minimize unnecessary memoryaccesses. This leads to a simple, effective and highlyefficient algorithm which runs at interactive rates.

PREVIOUS WORK

While manipulation of color content of images takesmany different forms, we are interested in changingan image by means of analysis of a second image.To date only three such algorithms are known. Thefirst is called Image Analogies and allows many dif-ferent characteristics of an image to be imposed upona second image, including such attributes as colorand brush-strokes (Hertzmann et al., 2001). This ap-proach is powerful, but perhaps too computationallyexpensive for real-time operation.

The second algorithm is statistical in nature and iscalled Color Transfer (Reinhard et al., 2001). Here,a decorrelated color space is used to analyze both ex-ample and target images - the statistical measures be-ing mean and variance. The pixels in the target im-age are shifted and scaled so that this image has thesame mean and variance as the example image. Thismethod is straightforward to implement and compu-tationally efficient. It was later extended to allow col-orization of grey-scale images (Welsh et al., 2002).

A weaknesses of this method is that the composi-tion of example and target images needs to be similarto obtain plausible results. An ad-hoc solution is toallow the user to specify pairs of swatches which willthen guide the color transfer process. While this ex-tends the use of the method, manual intervention isnot desirable for real-time applications.

The third algorithm improves the color trans-fer method considerably by considering colornames (Chang et al., 2004). Colors can be classifiedaccording to name, which is remarkably consistentacross languages and cultures. Rather than employ-ing swatches, each pixel in both example and targetimages is categorized into one of eleven color groups.The color adjustment is then applied such that colorsstay within their category. This produces very plau-sible results, albeit at the cost of extra computations.If computation time is not at a premium, we wouldadvocate the use of this color naming scheme.

However, in this paper we are interested in real-time color transfer to allow computer generated data

to be blended with video. We therefore opt to use theoriginal color transfer method. We apply this algo-rithm in essentially its basic form, but provide simpleextensions to overcome the need for using manuallyselected pairs of swatches. We also present exper-iments which show how many pixels per frame arerequired to compute statistics - thus allowing perfor-mance to be optimized without introducing artifactssuch as flicker between frames. Finally, we employa slightly different color space which further reducesthe number of computations.

COLOR TRANSFER

In RGB color space, the values of one channel arealmost completely correlated with the values in theother two channels. This means that if a large valuefor the red channel is found, the probability of alsofinding large values in the green and blue channelsis high. Color transfer between images in RGBcolor space would therefore constitute a complex 3-dimensional problem with no obvious solution.

To understand the color opponent space employedby the human visual system, Ruderman et al con-verted a set of spectral images of natural scenes toLMS cone space and applied Principle ComponentsAnalysis (PCA) to this ensemble (Ruderman et al.,1998). The LMS color space roughly corresponds tothe peak sensitivities of the three cone types found inthe human retina. The letters stand for Long, Mediumand Short wavelengths and may be thought of as red,green and blue color axes.

The effect of applying PCA to an ensemble of im-ages, is that the axes are rotated such that the firstchannel represents luminance, and the other channelscarry red-green and yellow-blue color opponent in-formation. Color opponent channels have a differentinterpretation than channels in more familiar colorspaces. For example, a value specified for red in RGBdetermines how much red is required to mix a givencolor. The values usually range from 0, i.e. no red, to255 indicating maximum red.

In a color opponent channel, values can be bothpositive and negative. For instance, in the yellow-blue color opponent channel a large positive value in-dicates ’very yellow’, whereas a large negative valueindicates ’very blue’. Values around zero indicatethe absence of yellow and blue, i.e. a neutral grey.The same is true for the red-green channel. A con-sequence of this representation that it is impossibleto represent a color that is blue with a tinge of yel-low. This is in agreement with human visual expe-rience — humans are incapable of seeing yellowishblue or greenish red. On the other hand color op-

2004 Paper No. 1502 of 9


ponent spaces predict that we should be able to seebluish red (purple) or greenish yellow.

Since the PCA algorithm decorrelates the datathrough rotation, it is evident that one of the firststeps of the human visual system is to decorrelatethe data. Ruderman et al found that while theoreti-cally the data is only decorrelated, in practice the databecomes nearly independent. In addition, by takingthe logarithm the data becomes symmetric and well-conditioned. The resulting color opponent space istermedLαβ .

The implications for color transfer are that ourcomplex 3-dimensional problem may be brokeninto three 1-dimensional problems with simple solu-tions (Reinhard et al., 2001). The color transfer al-gorithm thus converts example and target images toLMS cone space. Then the logarithm is taken, fol-lowed by a rotation to color opponent space. Here,the mean and variance of all pixels in both imagesare computed. This results in three means and threevariances for each of the images. The target imageis then shifted and scaled so that in color opponentspace the data along each of the three axes has thesame mean and variance as the example image. Theresult is then converted back to RGB space.

ALGORITHM

For real-time operation, we would like to apply thecolor transfer algorithm on a frame-by-frame basis.For this approach to be successful, it is paramountthat the number of computations executed for eachframe be minimized.

To reduce the number of computations, we are in-terested to know if executing the color transfer al-gorithm in logarithmic space is absolutely necessary.Our results show that this is not the case. We cantherefore omit taking the logarithm for each pixel.

Since the conversion from RGB color space toLMS cone space is achieved by multiplying eachRGB-triplet by a 3x3 matrix, and the conversion be-tween LMS andLαβ color opponent space is alsoachieved by matrix multiplication, these two colorspace conversions may be merged into a single matrixmultiplication which needs to be executed for eachpixel in the source and target frames. This optimiza-tion saves 3 logarithms and a 3x3 matrix multiplica-tion for each pixel in both images.

By omitting the logarithm, we effectively no longeruseLαβ space, but convert to an approximation ofthe CIE Lab color space instead (Wyszecki and Stiles,1982):

[

Lαβ

]

=

[

0.3475 0.8231 0.55590.2162 0.4316 −0.64110.1304 −0.1033 −0.0269

] [

RGB

]

(1)

The inverse transform to go from Lab to RGB is givenhere for convenience:

[

RGB

]

=

[

0.5773 0.2621 5.69470.5774 0.6072 −2.54440.5832 −1.0627 0.2073

] [

Lαβ

]

(2)

Within Lαβ space we compute the mean and stan-dard deviations in each of the L,α andβ channels forboth the rendered image and the captured frame. Afurther opportunity to reduce the number of computa-tions may be exploited by computing these statisticson only a subset of all pixels. We may for instancechoose to compute these statistics on only every tenthpixel, thus reducing the number of computations ten-fold for this step.

For the rendered data we also omit any pixelswhich are transparent since we will substitute videodata for these pixels during the compositing step.

As in the original color transfer algorithm. themeans and variances are used to adjust the pixels ofthe rendered frame. However, we have found that ifwe shift and scale the pixel data such that the meansand variances are precisely the same as the mean ofthe video capture, the results are too dramatic. For in-stance, if the video depicts a red brick wall, adjustingthe rendered data to be displayed in front of the wallto have the same means and variances will result inany data to become fully brick-colored. Rather thanrely on manually specified swatches to combat thisproblem, we introduce two user parameters to allowthe adjustment to be incomplete. These parametersdo not usually need to be adjusted for each frame, butare of the set-and-forget variety.

The adjustment of rendered pixels according to themeans and variances of the video data as well as thenewly introduced user parameters is then given by:

Ltarget = sL(

Ltarget− Ltarget)

+ Ltarget+dL (3)

αtarget = sα(

αtarget− αtarget)

+ αtarget+dα (4)

βtarget = sβ(

βtarget− βtarget)

+ βtarget+dβ (5)

In these equations,Ltarget is the mean luminance ofthe rendered frame andsL anddL are two constantsthat are computed once for every frame. Theα andβ channels are computed similarly.

The scale factorssL, sα and sβ adjust each pixelsuch that the variance of the rendered frame becomes

2004 Paper No. 1502 of 9


equal to the variance of the captured video (the ex-ample frame) and includes user parameters whichallows the adjustment to be incomplete:

sL = 1− s+ sσL,example

σL,target(6)

sα = 1− s+ sσα,example

σα,target(7)

sβ = 1− s+ sσβ ,example

σβ ,target(8)

Here, the variance observed in a particular channelfor example and target images is indicated withσ .The offsetsdL, dα anddβ steer the mean of the targetimage towards the mean of the example image to theextent dictated by user parameterd:

dL = d(

Lexample− Ltarget)

(9)

dα = d(

αexample− αtarget)

(10)

dβ = d(

βexample− βtarget)

(11)

We combine this adjustment with the actual com-positing step so that a pixel is either copied from thevideo capture, or it is a rendered pixel which is ad-justed by the above formula. Once each pixel haseither been copied or adjusted, the resulting image isconverted back to RGB space for display.

VISUAL QUALITY

The visual quality depends on the number of pixelsin each frame that is used in the computation of themeans and variances, as well as the user parameterss andd. We demonstrate the behavior of our algo-rithm by means of a short sequence of captured video— several frames are shown in Figure 1 — and arendered sequence of a rotating multi-colored sphere-flake, as shown in Figure 2. All frames were createdat a resolution of 720 by 480 pixels and then run-length encoded into two separate movies. All com-positing and color correction is then applied to thetwo movies without further referring to the originalseparate frames, or indeed the 3D geometry.

Simply compositing the results would cause thecolor content of the sphereflake to stand out and notmesh very well with the background (Figure 3). Fullapplication of our algorithm results in the framesshown in Figure 4. Here, the algorithm has been tooeffective. In particular, the fourth frame which showspredominantly a brick wall, has caused the sphere-flake to take on too much red. By adjusting thes userparameter to a value of 0.4 we obtain the sequence of

frames shown in Figure 5. We believe that this pa-rameter setting is a reasonable default value betweentoo much and too little adjustment.

We have also found that partial adjustment of theuser parameterd did not appreciably change the re-sult. We therefore conclude that this parameter maybe removed to save a small number of computations,leaving onlys as a user parameter.

With these default settings we combined three dif-ferent sequences of captured video with three differ-ent sets of rendered data. From each of the result-ing nine composited sequences we captured a repre-sentative frame, and show the results in uncorrectedform in Figure 6 and show the corrected images inFigure 7. We have found that for our test scenes fur-ther parameter tuning was unnecessary. We alwayssets to 0.4 and obtained plausible results. This doesnot mean that this approach will always yield accept-able results, but we have not found a combination ofvideo and rendered content that required adjustmentsto these parameter settings.

PERFORMANCE

A substantial performance gain was recorded by un-rolling loops and reordering memory access by hand.Our sequence rendered at a frame rate of around 13fps before our low-level code optimization, and im-proved to around 22 fps after this optimization. Asimple compositing algorithm which does not applyany color correction, but uses the same proceduresto display the resulting frames, runs under the sameconfiguration at 30 fps. We therefore conclude thatfor color correcting our test sequence incurs a perfor-mance penalty of around 28%.

This optimization includes a scheme whereby thecurrent frame is adjusted on the basis of means andvariances computed from the previous frame. In ad-dition the variances computed during the previousframe make use of means computed two frames back.This allows us to simplify and streamline the imple-mentation such that for each frame only a single passover all pixels is required.

The performance of our implementation dependsto some extent on the number of pixels that are ren-dered in each frame. It is the rendered pixels that areadjusted, whereas the video pixels are only analyzed.This dependency is evidenced by plotting the frame-rate over time, as shown in Figure 8. We see thatthe part of the sequence around frame 30 is slowerthan the first and last parts. This correlates with thenumber of rendered pixels, which we plot in black inFigure 8. The time per frame closely follows the high

2004 Paper No. 1502 of 9


0 10 20 30 40 50 60 70 80 90 100 0

0.05

0.10 T

ime

per

fra

me

(sec

onds)

0

5

10

4

Ren

der

ed p

ixel

s x 1

0

Number of rendered pixels

Time per frame with color correction

Time per frame without color correction

Frame number

Figure 8: Frame times and rendered pixels as func-tion of frame-number.

frequency fluctuations present in the version wherewe composite without color correction.

Our experiments with skipping frames for comput-ing statistics, or omitting pixels from considerationdid not prove to be successful. For instance, the ex-tra if-statement necessary to determine if the currentframe should be used to recompute means and vari-ances introduces a branch in the control flow whichcauses the floating point units of the CPU to becomepartially unfilled. The introduction of this single if-statement causes the frame-rate to drop by 3 fps.

In addition, the bottleneck in our computations isthe adjustment of pixels, and not the computation ofstatistics. Unfortunately, it is not possible to skippixel adjustment for any rendered pixels, since thiswould introduce flickering. For the purpose of com-parison, we omitted the computation of new statisticsfor each frame, while still correcting rendered pix-els, albeit with pre-cooked values. The frame-rate in-creased by around 1.5 fps. Compared with the 3 fpsthat an extra if-statement would cost, the overheadof computing new averages and means in brute-forcemanner for every frame is relatively inexpensive.

CONCLUSIONS

A simple and effective method to glean the overallcolor scheme of one image, and impose it upon an-other may be extended for use in real-time mixed-reality applications. Low level optimizations provedto be more effective in gaining performance than highlevel schemes such as computing statistics for onlya subset of the frames, or computing statistics ona subset of pixels within each frame. The branch-

ing introduced by such high level optimizations causethe floating point pipelines to be used less efficiently,which has a negative impact on performance. Theoverall effect is that performance is better with abrute-force approach than with these optimizations.

We introduced user parameters which make thetransfer of colors between the rendered stream andthe video stream partial, which improves the qual-ity of the results. For hands-off approaches suchas required in mixed-reality applications, such set-and-forget user parameters are more effective thanthe manual specification of swatches that the originalmethod employed as a means of user control.

While we demonstrated our algorithm in termsof a mixed-reality scenario, it is equally applicableto the work of colorists. Scenes may be adjustedsuch that the chromatic content of subsequent scenesdoes not exhibit unpleasant jumps. Creative applica-tions, whereby scenes are colored for a specific moodor artistic effect, is also catered for. We envisagethat scenes may be recolored according to existingfootage which already has the desired artistic effect.

ACKNOWLEDGMENTS

This work is partially supported by the Office ofNaval Research and the Army’s Research, Develop-ment & Engineering Command.

REFERENCES

Chang, Y., Saito, S., Uchikawa, K., and Nakajima, M. 2004.Example-based color stylization based on categorical percep-tion. In First Symposium on Applied Perception in Graphicsand Visualization.

Hertzmann, A., Jacobs, C. E., Oliver, N., Curless, B., andSalesin, D. H. 2001. Image analogies. InProc. of the 28th An-nual Conference on Computer Graphics and Interactive Tech-niques, 327–340.

Reinhard, E., Ashikhmin, M., Gooch, B., and Shirley, P. 2001.Color transfer between images.IEEE Computer Graphics andApplications 21, 34–41.

Ruderman, D. L., Cronin, T. W., and Chiao, C. 1998. Statisticsof cone responses to natural images: Implications for visualcoding.J. Opt. Soc. Am. A 15, 8, 2036–2045.

Welsh, T., Ashikhmin, M., and Mueller, K. 2002. Transferringcolor to greyscale images. InProc. of the 29th Annual Confer-ence on Computer Graphics and Interactive Techniques, 277–280.

Wyszecki, G., and Stiles, W. S. 1982. Color Science: Conceptsand Methods, Quantitative Data and Formulae.John Wileyand Sons.

2004 Paper No. 1502 of 9


Figure 1:Four frames from a sequence of video.

Figure 2:Four frames from a computer generated sequence.

Figure 3:Compositing without color correction.

Figure 4:Full adjustment.

Figure 5:Partial adjustment with a value of s = 0.4.

2004 Paper No. 1502 of 9


Figure 6:Uncorrected frames.

Figure 7:Corrected frames.

2004 Paper No. 1502 of 9

real-time color blending of rendered and captured...

Documents