isvc08

G. Bebis et al. (Eds.): ISVC 2008, Part II, LNCS 5359, pp. 420–429, 2008. © Springer-Verlag Berlin Heidelberg 2008

Stitching Video from Webcams

Mai Zheng, Xiaolin Chen, and Li Guo

Department of Electronic Science and Technology, University of Science and Technology of China

{zhengmai,myhgy}@mail.ustc.edu.cn, [email protected]

Abstract. This paper presents a technique to create wide field-of-view from common webcams. Our system consists of two stages: the initialization stage and the real-time stage. In the first stage, we detect robust features in the initial frame of each webcam and find the corresponding points between them. Then the matched point pairs are employed to compute the perspective matrix which describes the geometric relationship of the adjacent views. After the initializa-tion stage, we register the frame sequences of different webcams on the same plane using the perspective matrix and synthesize the overlapped region using a nonlinear blending method in real time. In this way, the narrow fields of each webcam are displayed together as one wide scene. We demonstrate the effec-tiveness of our method on a prototype that consists of two ordinary webcams and show that this is an interesting and inexpensive way to experience the wide-angle observation.

1 Introduction

As we explore a scene, we turn our eyes and head around, capture information in a wide field-of-view, and then get a comprehensive view. Similarly, a panoramic pic-ture or video can always provide much more information as well as richer experience than a single narrow representation. These advantages, together with various applica-tion prospects, such as teleconferencing and virtual reality, have motivated many researchers to develop techniques for creating a panorama.

Typically, the term “panorama” refers to single-viewpoint panoramic image [1], which can be created by rotating a camera around its optical center. Another main type of panorama is called the strip panorama [2][3], which is created from a translat-ing camera. But no matter which technical variants they use, creating a panorama starts with static images and requires that all of the frames to be stitched be prepared and organized as an image sequence before mosaicing. For a static mosaic, there is no time constraint on stitching all of the images into one.

In this paper, we propose a novel method to create panoramic video from web-cams. Different from previous video mosaics[4] which move one camera to record a continuous image sequence and then create a static panoramic image, we capture two-pass videos and stitch each pair of frames from the different videos in real-time. In other words, our panorama is a wide video displayed in real-time instead of a static panoramic picture.

Stitching Video from Webcams 421

2 Related Work

A tremendous amount of progress has been made in static image mosaicing. For example, strip panorama techniques [2][3] capture the horizontal outdoor scenes continuously and then stitch them into a long panoramic picture, which can be used for digital tourism and the like. Many techniques such as plane-sweep [5] and multi-view projection [6] have been developed for removing ghosting and blurring artifacts.

As for panoramic video, however, the technology is still not mature. One of the main troubles is the real-time requirement. The common frame rate is 25~ 30 FPS, so if we want to create video panorama, we need to create each panoramic frames within at most 0.04 seconds, which means that the stitching algorithms for static image mosaicing cannot be applied to stitch real-time frames directly. And due to the time-consuming computation involved, existing methods for improving static panorama can hardly be applied for stitching videos. To skirt these troubles, some researchers resort to hardware. For example, a carefully designed camera cluster which guarantees an approximate common virtual COP (center of projection) [7] can easily register the inputs and avoid parallax to some extent. But from another perspective, this kind of algorithm is undesirable because it relies heavily on the capturing device.

Our approach does not need special hardware. Instead, it makes use of the or-dinary webcams, which means that the system is inexpensive and easily applica-ble. Besides this, the positions and directions of the webcams are flexible as long as they have some overlapped field-of-view. We design a two-stage solution to tackle this challenging situation. The whole system will be discussed in section 3 and the implementation of each stage will be discussed in section 4 and 5 in detail.

3 System Framework

As is shown in Fig. 1, the inputs of our system are independent frame sequences from two common webcams and the output is the stitching video. To achieve real time, we separate the processing into two stages. The first one, called the initialization stage, only needs to be run once after the webcams are fixed. This stage includes several time-consuming procedures which are responsible for calculating the geometric rela-tionship between the adjacent webcams. We firstly detect robust features in the initial frame of each webcams and then match them between the adjacent views. The correct matches are then employed to estimate the perspective matrix. The next stage runs in real time. In this stage, we make use of the matrix from the first stage to register the frames of different webcams on the same plane and blend the overlapped region using a nonlinear weight mask. The implementation of the two stages will be discussed later in detail.

422 M. Zheng, X. Chen, and L. Guo

Fig. 1. Framework of our system. The initialization stage estimates the geometric relationship between the webcams based on the initial frames. The real-time stage registers and blends the frame sequences in real time.

4 Initialization Stage

Since the location and orientation of the webcams are flexible, the geometric relation-ship between the adjacent views is unknown before registration. We choose one web-cam as a base and use the full planar perspective motion model [8] to register the other view on the same plane. The planar perspective transform warps an image into another using 8 parameters:

11 12 13

21 22 23

31 32

'

' ' ~

1 1 1

x h h h x

y h h h y

h h

⎡ ⎤ ⎛ ⎞ ⎡ ⎤⎜ ⎟⎢ ⎥ ⎢ ⎥= = ⎜ ⎟⎢ ⎥ ⎢ ⎥⎜ ⎟⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎝ ⎠ ⎣ ⎦

u Hu . (1)

where ( ), ,1T

x y=u and ( )' ', ',1T

x y=u are homogeneous coordinates of the two

views, and ~ indicates equality up to scale since H is itself homogeneous. The per-spective transform is a superset of translation, rigid, and similarity well as affine transforms. We seek to compute an optimized matrix H between the views so that they can be aligned well in the same plane.

To recover the 8 parameters, we firstly extract keypoints in each input frame, and then match them between the adjacent two. Many classic detectors such as Canny [9] and Harris [10] can be employed to extract interesting points. However, they are not robust enough for matching in our case, which involves rotation and some perspective relation between the adjacent views. In this paper, we calculate the SIFT features [11] [12], which were originally used in object recognition.

Feature Detection

Feature Matching RANSAC

Projective Matrix

Webcams

Frames of

Narrow View

Ｎ

Ｙ

Initialized ?

Projection &

Blending

Frames of

Wide View

Display

Initialization

Real-time Stage


Simply put, there are 4 extraction steps. In the first step, we filter the frame with a Gaussian kernel:

( ) ( ) ( ), , , , ,L x y G x y I x yσ σ= ⋅ . (2)

where ( ),I x y is the initial frame and ( )( )2 2

222

1, ,

2

x y

G x y e σσπσ

+−

= ⋅ . Then we con-

struct a DoG (Difference of Gaussian) space as follows:

( ) ( ) ( ), , , , , ,D x y L x y k L x yσ σ σ= − . (3)

where k is the scaling factor. The extrema in the DoG space are taken as keypoints. In the second step, we calculate the accurate localization of the keypoints through

Taylor expansion of the DoG function:

2

2

1( )

2

TTD D

D D∂ ∂= + +∂ ∂

v v v vv v

. (4)

where ( , , )Tx y σ=v . From formula (4), we get the sub-pixel and sub-scale coordi-

nates as follows:

2 1^

2.

TD D−∂ ∂= −∂ ∂

vv v

. (5)

A threshold is added on the ^

( )D v value to discard unstable points. Also, we make

use of the Hessian matrix to eliminate edge responses:

( )( )

( )21Hes

Hes

Tr r

Det r

+<

M

M . (6)

where xx xy

Hesxy yy

D D

D D

⎛ ⎞= ⎜ ⎟⎝ ⎠

M is the Hessian matrix, and ( )HesTr M and ( )HesDet M are

the trace and determinant of HesM . r is an experience threshold and we set r =10 in

this study. In the third step, the gradient orientations and magnitudes of the sample pixels

within a Gaussian window are employed to calculate a histogram to assign the key-point with orientation. And finally, a 128D descriptor of every keypoint is obtained by concatenating the orientation histograms over a 16× 16 region.

By comparing the Euclidean distances of the descriptors, we get an initial set of corresponding keypoints (Fig. 2 (a)). The feature descriptors are invariant to transla-tion and rotation as well as scaling. However, they are only partially affine-invariant, so the initial matched pairs often contain outliers in our cases. We prune the outliers


by fitting the candidate correspondences into a perspective motion model based on RANSAC [13] iteration. Specifically, we randomly choose 4 pairs of matched points in each iteration and calculate an initial projective matrix, then use the formula below to find out whether the matrix is suitable for other points:

'

'

1 1

n n

n n

x x

y y θ⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟− ⋅ <⎜ ⎟ ⎜ ⎟

⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠

H . (7)

Here H is the initial projective matrix andθ is the threshold of outliers. In order to get a better toleration of parallax, a loose threshold of inliers is used. The matrix con-sistent with most initial matched pairs is considered as the best initial matrix, and the pairs fitting in it are considered as correct matches (Fig. 2 (b)).

Fig. 2. (a) Two frames with large misregistration and initial matched features between them. Note that there are mismatched pairs besides the correct ones (b) Correct matches after RANSAC filtering

After purifying the matched pairs, the ideal perspective matrix H is estimated us-ing use a least squares method. In detail, we construct the error function below and minimize the sum of the squared distances between the coordinates of the correspond-ing features:

2 2

error warp, base, warp , base,1 1

F 'N N

n n n nn n= =

= − = −∑ ∑Hu u u u (8)

where base,nu is the homogeneous coordinate of the thn feature in the image to be projected to, and warp,nu is the correspondence of base,nu in another view.

(a)

(b)


5 Real-Time Stage

After obtaining the perspective matrix between the adjacent webcams, we project the frames of one webcam onto another and blend them in real-time. Since the webcams are placed relatively freely, they may not have a common center of projection and thus are likely to result in parallax. In other words, the frames of different webcams cannot be registered strictly. Therefore, we designed a nonlinear blending strategy to minimize the ghosting and blurring of the overlapped region. Essentially, this is a

kind of alpha-blending. The synthesized frames synF can be presented as follows:

( ) ( ) ( )( ), , 1 , *syn base projF x y x y F x y Fα α= ∗ + − (9)

where baseF are the frames of the base webcam, projF are the frames projected from

the adjacent webcam, and ( ),x yα is the weight on pixel ( ),x y .

In the conventional blending method, the weight of pixels is a linear function in proportion to the distance to the image boundaries. This method treats the different views equally and performs well in normal cases. However, in the case of severe parallax, the linear combination will result in blurring and ghosting in the whole over-lapped region, as is the case in Fig 3(b), so we use a special α function to give prior-ity to one pass to avoid the conflict in the overlapped region of two webcams. Simply put, we construct a nonlinearα mask as below:

( )1,if min

, sin( (min( , , , ) / 0.5) 1,

2

(x,y, x -W , y -H )>T

x y x y x W y H Totherwise

α π⎧⎪= ⎨ ⋅ − − − +⎪⎩

(10)

where W and H are the width and height of the frame, and T is the width of the nonlinear decreased border. The mask is registered with the frames and clipped ac-cording to the region to be blended. The α value remains the same in the central part of the base frame, and begins to drop sharply when it comes close enough to the boundaries of another layer. The gradual change can be controlled by T . The transi-tion of different frames is smoother and more natural if T is larger, but the clear cen-tral region is also smaller, and vice versa. We refer to this method as nonlinear mask blending. By this nonlinear synthesis, we keep a balance between the smooth transi-tion of boundaries and the uniqueness and clarity of the interiors.

Fig. 3. Comparison between linear blending and our blending strategy on typical scenes with severe parallax

(a) A typical pair of scene (b) Linear Blending (c) Our blending

with strong parallax


6 Results

In this section, we show the results of our method on different scenes. We built a prototype with two common webcams, as is shown in Fig, 4. The webcams are placed together on a simple support, and the lenses are flexible and can be rotated and di-rected to different orientations freely. Each webcam has a resolution of QVGA (320×240 pixels) with a frame rate of 30 FPS.

Fig. 4. Two common webcams fixed on a simple support. The lenses are flexible and can be rotated and adjusted to different orientations freely.

Table 1. Processing time of the main procedures of the system

Stage Procedure Time (second) Feature detection 0.320 ~ 0.450 Feature matching 0.040 ~ 0.050 RANSAC filtering 0.000 ~ 0.015

Initialization

Matrix computation 0.000 ~ 0.001 Real time Projection and Blending 0.000 ~ 0.020

The processing time of the main procedures is listed in Table 1. The system runs

on a PC with E4500 2.2GHz CPU and 2GB memory. The initialization stage usually takes about 0.7 ~ 1 second, according to the content of the scene. The projection and blending usually takes less than 0.02 seconds for a pair of frames, thus can run in real-time. Note that whenever the webcams are changed, there should be an initialization stage to re-compute the geometric relationship between the webcams. Currently, this re-initialization is started by user. After the initialization, the system can process the video at a rate of 30FPS.

In our system, the positions and directions of the webcams are adjustable as long as they have some overlapped field-of view. Typically, the overlapped region should be 20% of the original view or above, otherwise there may not be enough robust features to match between the webcams. Fig. 5 shows the stitching result of some typical frames. In these cases, the webcams are intentionally rotated to a certain angle or even turned upside down. As can be seen in the figures, the system can still register and blend the frames into a natural whole scene. Fig. 6 shows some typical stitching scenes from a real-time video. In (a), the two static indoor views are stitched into a wide view. In (b) and (c), some moving objects show up in the scene, either far away or close to the lens. As illustrated in the figures, the stitching views are as clear and natural as the original narrow view.


(a)A pair of frames with 015 rotation and the stitching result

(b) A pair of frames with 090 rotation and the stitching result

(c) A pair of frames with 0180 rotation and the stitching result

Fig. 5. Stitching frames of some typical scenes. The webcams are intentionally rotated to a certain angle or turned upside down.

(a) A static scene

(b) A far away object moving in the scene

(c) A close object moving in the scene

Fig. 6. Stitching results from a real-time video. Moving objects in the stitching scene are as clear as in the original narrow view.


Although our system is flexible and robust enough in normal conditions, the qual-ity of the mosaicing video does drop severely in the following two cases: firstly, when the scene lacks salient features, as is the case of a white wall, then the geometric rela-tionship of the webcams cannot be estimated correctly; secondly, when the parallax is too strong, there may be noticeable traces of stitching in the frame border. These problems can be avoided by targeting the lens at some salient scenes and adjusting the orientation of the webcams.

7 Conclusions and Future Work

In this paper, we have presented a technique for stitching videos from webcams. The system receives the frame sequences from common webcams and outputs a synthe-sized video with a wide field-of-view in real time. The positions and directions of the webcams are flexible as long as they have some overlapped field-of-view. There are two stages in the system. The initialization stage calculates the geometric relationship of frames from adjacent webcams. A nonlinear mask blending method which avoids the ghosting and blurring in the main part of the overlapped region is proposed for synthesizing the frames in real time. As illustrated by experimental result, this is an effective and inexpensive way to construct video with a wide field-of-view.

Currently, we have only focused on using two webcams. As a natural extension of our work, we would like to boost up to more webcams. We also plan to explore the hard and interesting issues of how to eliminate the exposure differences between webcams in real time and solve the problems mentioned at the end of last section.

Acknowledgment

The financial support provided by National Natural Science Foundation of China (Project ID: 60772032) and Microsoft (China) Co., Ltd. are gratefully acknowledged.

References

1. Szeliski, R., Shum, H.Y.: Creating Full View Panoramic Mosaics and Environment Maps. In: Proc. of SIGGRAPH 1997, Computer Graphics Proceedings. Annual Conference Se-ries, pp. 251–258 (1997)

2. Agarwala, A., Agrawala, M., Chen, M., Salesin, D., Szeliski, R.: Photographing Long Scenes with Multi-Viewpoint Panoramas. In: Proc. of SIGGRAPH, pp. 853–861 (2006)

3. Zheng, J.Y.: Digital route panoramas. IEEE MultiMedia 10(3), 57–67 (2003) 4. Hsu, C.-T., Cheng, T.-H., Beukers, R.A., Horng, J.-K.: Feature-based Video Mosaic. Im-

age Processing, 887–890 (2000) 5. Kang, S.B., Szeliski, R., Uyttendaele, M.: Seamless Stitching Using Multi-Perspective

Plane Sweep. Microsoft Research, Tech. Rep. MSR-TR-2004-48 (2004) 6. Zelnik-Manor, L., Peters, G., Perona, P.: Squaring the Circle in Panoramas. In: Proc. 10th

IEEE Conf. on Computer Vision (ICCV 2005), pp. 1292–1299 (2005)


7. Majumder, A., Gopi, M., Seales, W.B., Fuchs, H.: Immersive Ieleconferencing: A New Algorithm to Generate Seamless Panoramic Imagery. In: Proc. of ACM Multimedia, pp. 169–178 (1999)

8. Szeliski, R.: Video Mosaics for Virtual Environments. IEEE Computer Graphics and Ap-plications, 22–23 (1996)

9. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans Pattern Analysis and Machine Intelligence 8, 679–698 (1986)

10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. of the 4th Alvey Vision Conference, pp. 147–151 (1988)

11. Lowe, D.G.: Distinctive Image Features From Scale-invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)

12. Winder, S., Brown, M.: Learning Local Image Descriptors. In: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR 2007), pp. 1–8 (2007)

13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Engle-wood Cliffs (2003)

isvc08

Education