compressed video processing for cut detection

Compressed video processing for cut detection

N.V. Patel I.K.Sethi

Indexing terms: Chi-square test, Compressed video, Cut detection, Hypothesis testing, K-S test, MPEG, Video databases, Video segmentation, Ydzimovski test

Abstract: One of the challenging problems in video databases is the organisation of video information. Segmenting a video into a number of clips and characterising each clip has been suggested as one mechanism for organising video information. This approach requires a suitable method to automatically locate cut points (boundaries between consecutive camera shots in a video). Several existing techniques solve this problem using uncompressed video. Since video is increasingly being captured, moved, and stored in compressed form, there is a need for detecting shot boundaries directly in compressed video. The authors address this issue and show certain feature extraction steps in MPEG compressed video that allow the implementation of most of the existing cut detection methods developed for uncompressed video for MPEG video stream. They also examine the performance of three tests for cut detection by viewing the problem of cut detection as a statistical hypothesis testing problem. As the experimental results indicate, the statistical hypothesis testing approach permits fast and accurate detection of video cuts.

1 Introduction

A shot or take in video parlance refers to a contiguous recording of one or more video frames depicting a continuous action in time and space [l]. During a shot, the camera may remain fixed, or it may exhibit one of the characteristic motions, panning, zooming, tilting, or tracking. In recent years, methods for automatic determination of shot boundaries, known as cut points, to isolate shots in an edited video have received considerable attention due to many practical applications. For example, in video databases the isolation of shots is of interest because the shot level organisation of video documents is considered most appropriate for video browsing and content based retrieval [24]. Shots also provide a convenient level for the study of styles of different filmmakers. A statistical characterisation of a video is also possible in terms of

0 IEE, 1996 IEE Proceedings online no. 19960778 Paper first received 25th September 1995 and in revised from 1st July 1996 The authors are with Vision and Neural Networks Laboratory, Depart- ment of Computer Science, Wayne State University, Detroit, MI 48202, USA

different attributes of a shot, for example, shot length, shot type (close shot, medium shot, and long shot, and camera movement in the shot). Such a characterisation is useful to differentiate between the styles of different moviemakers [5]. A characterisation of this nature can also provide a way of clustering video documents at a global level. Shot isolation is also needed in the colouration of black and white movies where each shot has its own grey-to-colour mapping table.

The isolation of shots in a video is relatively easy when shot transitions consist of visually abrupt straight cuts or camera breaks. Such transitions, known as straight cuts, are relatively easy to detect by examining frame-to-frame intensity differences at the pixel level. In many cases, a transition between two shots is made in a gradual manner to obtain a visually pleasing transition. Such gradual transitions are introduced in a video through special editing equipment and span several frames. These transitions are generally called optical cuts. There are many different kinds of optical cuts such as fade in, fade out, dissolve or wipe, each with its own characteristic transition signature. Such cuts are difficult to isolate through the frame-to-frame intensity difference types of approaches and require some kind of global level characterisation of each frame for comparison purposes.

The majority of existing techniques for cut detection or video segmentation process uncompressed video. Since video is increasingly being captured, moved, and stored in compressed form, there is a need for detecting shot boundaries directly in the compressed video. Such an approach offers two advantages. First, the data rate is lower. Secondly, it avoids the decompression - recompression cycle that becomes necessary iff the video is in compressed form. In this paper, we adopt this view and show certain feature extraction steps in MPEG compressed video that allow the implementation of most of the existing cut detection methods developed for uncompressed video for MPEG video stream. We also examine the performance of three tests by viewing the problem of cut detection as a statistical hypothesis testing problem. As our experimental results indicate, the statistical approach permits fast and accurate detection of scene changes induced through straight as well as optical cuts.

2 Previous work

Work related to cut detection can be traced back to the pioneering work of Seyler [6] who first systematically studied the nature of frame difference signals and showed that the gamma distribution provides a good fit to the probability density function of frame difference signals. Following Seylor’s work, Col1 and

315 IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 5, Octobes 1996

Choma [7] performed an extensive experimental study of frame difference signals using four different types of videos: a sports video with extensive camera movement, an action video using a variety of camera techniques, a talk-show video with fixed background and little camera motion, and an animated cartoon video with small figures moving against a fixed background. This study was carried out by dividing each image frame into 8 x 8 blocks. Each block was then represented by its average grey scale intensity value. The average of the magnitudes of the differences between corresponding blocks of consecutive frames was used as a frame difference measure. Col1 and Choma showed that this measure was able to locate shot boundaries with high accuracy. They also collected statistics on time interval between successive shot transitions and showed that it followed a Poisson distribution whose parameters depend upon the nature of the underlying video.

In the area of computer vision, the problem of scene change detection in image sequences has been studied by Sethi, Salari, and Vemuri [8]. Their study was motivated by their desire to establish correspondence of objects across discontinuities in the camera motion. They suggested using a three-frame approach to locate scene changes. Letting S, and S,k as measures of similarities between consecutive frame pairs F, - FJ and I$ - Fk, respectively, their scene change measure called, the observer motion coherence (OMC), is defined as:

which returns a value close to one if there is no scene change in the three frames under consideration, and a value close to zero otherwise. To measure similarity between a pair of frames, Sethi, Salari, and Vemuri suggested using either the normalised difference energy (NDE) measure or the absolute difference to sum ratio (ADSR). These are defined as:

The recent methods for cut detection can be characterised either as frame difference methods or histogram based methods. An example of frame difference methods is the method due to Tonomura and Otsuji [9], which uses the number of pixels undergoing changes from one frame to the next for finding -cut points. This is achieved by taking the difference of subsampled images and thresholding at suitable value. Let Sx, y ( t ) be a subsampled image at time t . The interframe difference coefficient ID,,, can be calculated as:

IDS,, = 1, if(lISx,,(t) - S,& - 1)11 2 T d ) V x , y € S

The actual decision for cut point is taken by evaluating ID,, over three frames as,

if (IIIDsum(K) - ID,,m(K - 1)11 2 T,) and (llIDsum(K) -IDsum(K+l)ll STb) c u t =

where Tu and Tb are suitable thresholds. The pure

True

{ False otherwise

316

interframe difference approach cannot cope with noise, minor illumination changes, and camera motion. To improve performance, several modifications to the basic scheme of frame by frame comparison have been proposed. These include a five-frame comparison scheme [9] and a twin-comparison approach [ 101 which performs a simple motion analysis to check whether an actual transition has occurred or not. Since the input video is in uncompressed form, performing motion analysis requires extensive computation to obtain motion vectors.

Noting that the interframe difference tends to perform poorly for large close object motion, illumination changes and camera motion, the use of intensity histogram comparisons has been suggested [ll]. In this method the cut points are located using absolute sum of the difference of intensity histogram according to the following:

c u t =

where HR(t) and HR(t + Vt) are histograms of frame at time t and time t + V t , respectively. This method assumes that brightness distribution is related to the image, which only changes if the image changes. The use of histogram suppresses small object motions giv- ing rise to fewer false cuts. The use of colour histogram has also been explored by Nagasaka and Tanaka [12]. They used the colour correlation between two frames in limited colour space to identify cut points. The cut point is detected by applying a suitable threshold to block correlation. The correlation is established using the following expression:

True if (E:", I IHR(~ + V t ) - H ~ ( t ) l l 2 Th) i False otherwise

where m represents the block index, K is the number of colours used, and Hm,z is the histogram for colour i. Nagasaka's method then sorts all the correlation values and utilises eight lowest values to get summed correlation quantity CJum. The cut point is detected if C,,, exceeds a certain threshold. Ueda [13] uses the same method by applying threshold on 48 blocks and count- ing the number of blocks over the threshold. The method due to Abe, Tonomura and Kasahara [14] uses the likelihood ratio test [15] to compare two corresponding blocks in consecutive frames. In this test, the likelihood that two blocks come from different scenes is computed by the following expression:

2

where m, and mJ denote the mean intensity values of the blocks and 0: and $represent the respective variances. This methods gives better performance because it takes into account information at a block level rather than at a pixel level to make a decision.

Gragi and Devadiga [16, 171 tried to study the best colour space suitable for video segmentation. They used the weighted summation of difference of histogram difference and histogram intersection over a spec- ified neighbourhood in RGB, HSV, YIQ, L*a*b*, L*u*v* and Munsell colour space. Their computation is based on the following relations,

H D ( 2 ) = HI(%) - H2(2)

IEE Pvoc -Vis Image Signal Process, Vol 143, No 5, October 1996

3 E N ( z ) where H I and H2 are the colour histograms of two consecutive video frames, HD is the histogram difference, HI is the inverted histogram intersection, and T, is the cut coefficient. A suitable threshold for T, determines the cut points.

In addition to the above methods, model based schemes have been recently suggested to obtain improved cut detection. The work in this category includes methods due to Aigrain and Joly [18], Swanberg, Shu, and Jam [19], and Zhang et al. [20]. The method due to Aigrain and Joly is based on the modelling of the distribution of pixel differences. Using this model, they estimate the nature of the difference histogram for different kinds of cuts. These estimates are then used to locate cuts in a given video. Swanberg, Shu, and Jain instead follow a different modelling approach. In their scheme, a model of the knowledge pertaining to video content is created and used for cut detection. Zhang et al. have used this approach to parse news videos.

Another method that is worth pointing out is due to Shahraray [21] who views the problem of cut detection as a nonuniform content based temporal filtering problem. This method uses a block matching technique to compute block motion vectors and normalised block matching values over successive neighbouring frames. A nonlinear digital order statistic filter is used to combine the block match values to generate a global match value to detect scene transitions. To detect gradual transitions, a motion controlled temporal filtering is performed on the sequence of global image match values. This method also uses the motion vector information to identify inner shot boundaries due to camera motion.

While all the methods discussed above process uncompressed video for cut detection, recently few methods have been reported using compressed video. An early example of cut detection in compressed video is the method due to Arman, Hsu, and Chiu [22]. In this method, a subset of blocks from each JPEG encoded frame is used for scene change detection. From each block, a subset of DCT coefficients is selected, The combined set of coefficients from the selected blocks then forms a vector that characterises the underlying frame. Cut detection is done by computing the inner product of two frame vectors representing consecutive frames. The drawback of this method is that its success depends critically on the choice of blocks for which no clear cut guidelines are available. Zhang et al. [23] have extended the method of Arman, Hsu, and Chiu to MPEG encoded video. Deardroff and Little [24] demonstrated the use of M-JPEG compressed images for cut detection. Boon and Bede [25] perform cut detection using lower resolution images formed with the DC value of DCT blocks.

Some of the recent work also shows the usefulness of motion vectors and different types of macroblocks for scene change detection. Juan and Chang [26] use the DCT coefficients from I frames and motion vector distributions from P and B frames for scene change detection. Hain-Ching and Gregory [27] demonstrated the use of difference matrix and motion vectors for this

IEE Proc -Vis Image Signal Process, Vol 143, No 5, October 1996

purpose. They tried to predict the relationship between current frame and reference frame using the macroblock types in P and B frames. Cherfaoui and Bertine [28] tried to predict the camera motion to do video segmentation.

The direct processing of compressed video has also been demonstrated in the area of video manipulation by several researchers. For example, Chang, Chen and Messerschmitt [29, 301 have derived efficient algorithms to perform overlapping, translation, scaling, pixel manipulation, rotation and linear filtering in the DCT domain. They also extended the algorithms to work with motion compensated encoded video by first converting encoded video to DCT domain. Shen and Sethi [31] have suggested a block transform approach involving sign changes to perform arbitrary geometric transformations in the DCT domain.

3 Feature computation in MPEG stream

The MPEG group was formed in 1988 to generate standards for digital video and audio compression. The standard defines a compressed bit stream, which implicitly defines a decompresser. Most compression standards utilise the redundant information available in the data stream (video or audio) to be compressed and so does MPEG. Since audio compression is out of the scope of this work, we will concentrate only on MPEG digital video compression standard. The continuous motion video contains spatial as well as temporal redundancy. The spatial redundancy is provided by neighbouring pixels in a single frame, which is encoded by 8 x 8 DCT block coding followed by Huffman encoding in MPEG standard. The temporal redundancy is due to the fact that the video is composed of smaller clips, which are continuous in nature. The notion of block matching and motion vector estimation is used to avoid coding of temporally repeated blocks. The discussion in this Section is limited to a brief over- view of MPEG structure and relevant features directly available in encoded stream for video parsing.

layered MPEG structure

group of pictures (GOP)

I four luminance motion vectors block macroblock prediction

m-p !..;c ;h rom in ance

Fig. 1 Layered structure of MPEG encoding

3. I Structural hierarchy of MPEG The MPEG video stream has a hierarchical layered structure as shown in Fig. 1. The top level layer is denoted as group of pictures (GOP), which consists of different types of encoded frames. These include one I (intra), M number of P (prediction) and N number of B (bidirectionally predicted) frames. The GOP is the basic chunk which provides random access to the video

317

stream in MPEG. Within a GOP, the I frame is the first frame available for random access.

The next level encoding is done for each individual frame. The first frame of each GOP is an I frame. It is followed by interleaved P and B frames until the end of GOP. The better understanding of encoding process of I, P and B frames gives us greater insight on how one can use encoded information directly for processing purpose. The details of encoding of I frame and P, B frames are shown in Figs. 2 and 3. Each individual colour frame is first converted to YUV colour space, where the first component ( Y) gives the luminance information and the remaining two ( U and V) give the chrominance information. The chrominance component of each frame is decimated by half in the X and Y directions. It has been shown that decimation of chrominance channels does not reduce the image quality for natural scenes by noticeable amount. I\< \ - -

Fig.2 Coding schemefov I, P, and Bfiame Intrablock coding process

<T apply D C T to MCPE . - motion W ~ C ~ P ~ E K

macb,k estimation b i tst r. 10 .... 0

>Ta[ intracoding

Fig.3 process

Coding scheme for I, P, and B f Y m e P and B fiame coding

The I, P and B frames are further decomposed into smaller units called macroblocks (16 x 16 pixels) and microblocks (8 x 8 pixels). The horizontal stripe consisting of all aligned macroblocks is called a slice. It is used for predicting the DC component of each block in it. Each macroblock in an image is composed of four microblocks. All the blocks in an I type frame are encoded by direct cosine transformation (DCT). The P and B frames are encoded at macroblock level by estimating the motion vectors of individual macroblock.

3.2 lntraframe encoding As discussed earlier, the spatial redundancy is organ- ised by applying DCT to each 8 x 8 size block in the image. Intra frame encoding is accomplish by dividing entire image in 8 x 8 blocks. The basic schema for I frame encoding is shown in Fig. 2. The basic equation for DCT is:

7 7 1 (2z+l)u7r (2y+l)w7r Du,u = ,cucuc z,ycos[ 16 ]cos[ 16 ]

z=o y=o

where

and D , is a target cell of DCT coefficient and D , is a target cell of raw block.

The quantisation and zigzag coding is applied to transformed block to reduce the number of bits and organise the block for variable length coding (VLC). The quantisation converts most of the high frequency components in the block to zero, maintaining the least error in encoding of low frequency components. The zigzagging process organises the quantised data suited

318

for VLC. It is done by Huffman encoding using fixed tables. The DCT coefficients have a special two- dimensional Huffman table in that one code specifies a run-length of zeros and nonzero values that ended the run.

3.3 P and B frame encoding The P and B frames are encoded by motion prediction using previous frames (in P type) or using previous and future frames (in B type). It is done at the macroblock level in the luminance (Y) channel. Every macroblock in the current frame is searched for its closest match in the previous or future frame. Many different search techniques have been proposed for this task. The motion compensated prediction error (MCPE) is than computed by taking the difference of two matching macroblocks. If MPCE exceeds a certain threshold, all the blocks in that macroblock are intra (Z) coded as described earlier; otherwise difference block is intra (4 coded and transmitted along with motion vector. Motion vectors also provide intermediate information for many video parsing algorithms.

3.4 Feature computation As discussed earlier, MPEG uses DCT for block coding. The DCT provides much useful information that one can use for image or video processing. In the following, we show how some of this information can be extracted from an encoded image. All our discussion in this Section assumes feature identification in a subsampled 8 x 8 sized block.

3.4.1 Mean (pb) and global view: The first component of a transformed block is known as the DC value, which is the average of all the pixel values in the raw block. This can be seen by analysing the transformation expression at location (0, 0), which is

. 7 7

2=0 y=o where D , is the raw pixel value in the block at location x and y . This is very important information utilised by many existing techniques working on compressed videos. It provides the average intensity of the block. An image created using only DC values of all the blocks provides a global view of the original image. Processing such subsampled images provides significant processing speed. Furthermore, the built-in smoothing of the low resolution image reduces the false scene change detection rate due to large camera or object motion. It has also been shown that many important image processing operations can directly be performed on DCT coded blocks [32].

3.4.2 Variance (o?): The nature of a block can be identified by measuring the smoothness or the coarse- ness of the block. In other words, activity o f a block is another important information. The term activity per- tains to the variance in pixel values in the block. For images, large variance in a block is induced by object edges falling in that block. Many statistical techniques use the variance information for different studies. The variance of a block can be computed using the following expression:

~ N-1 N - l

x=o y=o

IEE Proc-Vis Image Signal Process, Vol. 143, No. 5, October 1996

N-1 N-1 1 2 - D:,g - f i b

z=o y=o N2

This can be measured directly in the DCT domain. As energy is preserved in orthogonal transformation, without the loss of generality, we can write:

N - 1 N - 1 N - l N-1 1 AT2 I.

2=0 y=o u=o u=o Therefore, the variance of an encoded block can be directly computed as:

~ N-1 N - l

u=o u=o The information on variance (0,’) and average block value ( & ) can further be utilised for histogram estimation of a block and image as described next.

3.4.3 Histogram estimation: The intensity histogram provides a global description about the appearance of an image. Many earlier scene change detection algorithms have utilised this feature for finding the cut points in a video stream. The histogram can be computed by measuring probability distribution of pixel values in the entire image. Let us identify the pixel values by notation i, which ranges from 0 - 255. If there are a total of N pixels in an image block, the normalised histogram quantity Hb,i of the block can be computed by the following expression:

where 1 if D2,y = i 0 otherwise

Fig.4 Lenna girl

Although a grey level probability density cannot be normal, since grey levels are non-negative and occupy only a finite range, the normal density can be a reasonable approximation to actual probability density if the mean grey level is well within the range and the standard deviation is small relative to the size of range [33]. The mean value of a block intensities is generally well within the grey level range. Our empirical study shows that more than 98% of block standard deviation is less than 30, which is small enough for a range of 0 - 255.

IEE Proc -Vis Image Signal Process, Vol 143, No 5, October 1996

Fig. 5 demonstrates this fact for the test image of Fig. 4. It is very well seen that the majority of block standard deviations fall well below 30. Therefore, it is reasonable to say that the histogram of an 8 x 8 subim- age block follows the Gaussian distribution and 99.7% of the data points lie within 230, distance from its mean (pb). Let the value of the Gaussian function at intensity i be given by Gb,i, then the relationship is expressed as:

If the total number of points N enclosed in the Gaus- sian bell is known, we can compute the histogram quantity for pixel value i as:

while the normalised histogram is given by:

where i 5 & 3 0 ~ .

e o 30

2 20

10

0

TI

IJl

0 500 1000 1500 2000 2500 3000 3500 block number

Fig.5 Distribution of standard deviation of 8 x 8 blocks

The block histogram information can be further utilised to compute a region histogram. The region R could be the entire image or a selected template region. This can be computed by the following steps on a compressed image. (i) Find the number of blocks M enclosed in region R. (ii) Identify mean value pb of each block and compute the block variance 02. (iii) The histogram can be computed using the following expression:

where N = 64 and i € 0 - 255. Following the above procedure, the estimation of the intensity histogram is shown in Fig. 7 for the Lenna girl image. For comparison we have also shown the original histogram of the image (Fig. 6). It can be seen that both the histograms have a very close intensity distribution.

3.4.4 lnterframe differencing: The interframe difference between two frames can be directly computed from a DCT image using the additive property of the transformation. If D , is the raw pixel value at location x, y , and D , is the DCT coefficient at location U , v, the DCT of the frame difference is

319

Fig.6 Histogram derivedfvom yaw image of Lenna girl

Fig. 7

given by:

Histogram derivedfrom compressed image of Lenna girl

DCT(Dz, , ( t ) - Dz,,(t - 1))

=+c,c, C ( D . g ( i ) - D z y(t-l))cos[-]cos[- s = o y=o 14 ( " N N

(2s + I ) 1 L 7 r =$C,C, D s , y ( t ) c o s [ T ] cos[-] a = o y=o

N N -$C,C, D , , , , ( t - l ) ~ o s [ ~ ] (21+1)?,7r COS[*]

= Dup(t) - Du,,(t - 1)

z=o y=o

This shows that the DCT of the interframe difference is the same as the difference of the DCT individual image. One can utilise this fact and compute the frame difference directly in the frequency domain. The suitable choice of the threshold would give the same result as can be achieved by raw frame differencing.

4 testing

In this Section, we present a shot transition detection approach based on statistical hypothesis testing. We suggest three tests that are all suitable for implementation on compressed video. Our approach is based upon the following considerations: (i) Within the same shot, a frame may differ from its neighbouring frames due to one or more of the following factors: object movement, camera movement,

Shot transition detection through hypothesis

320

focal length changes, and lighting changes. The shot transition detection process should ignore, as far as possible, such frame-to-frame changes within a shot. One way to achieve this is to perform large spatial averaging on frame data prior to scene change detection. (ii) To minimise fLirther the effect of frame to frame changes within a shot, the detection process should be applied at a frame description level which takes into account information present in the entire frame. The intensity and colour histograms of the entire frame provide such a descriptive level, which is simple and easy to compute. Our experimental experience indicates that the intensity histogram representation of a frame is enough for shot change detection. (iii) To perform accurate shot boundary detection, knowledge of the shot type is essential. A shot type refers to the gradation of distances between the camera and the recorded scene [l]. Although there are infinite gradations of distances between the camera and the recorded scene, each shot is generally considered to fit into one of the five basic gradations. These are: close up, close shot, medium shot, full shot, and long shot. These gradations do not imply a fixed measurable distance in each case, but rather are defined with respect to the subject being recorded. For example, the close shot of a human subject implies entirely different distances than the close shot of a house. Since the magnitude of frame to frame changes within a shot is a function of shot type (e.g. an object movement in a close shot will induce more significant changes, say, in comparison with a medium or long shot) simply comparing the magnitude of changes is not an appropriate approach unless the comparison is normalised with respect to the shot type. Since shot type classification depends upon the distance between the camera and the subject being recorded (e.g. a person, a house, or an automobile), it is clear that this kind of normalisation capability without a priori scene knowledge is impossible to achieve. (iv) Often during the editing process, a longer shot is cut into smaller strips to achieve either intercutting with shorter shots or parallel blending with another longer shot [l]. Effectively, such editing steps create false transitions in shots. (v) In view of the above, it is safe to say that false transitions in a shot change detection procedure are natural. Furthermore, false detections are relatively harmless, as the frames within such shots are still coherent in action and satisfy all the requirements of a shot definition. (vi) Misdetection of an actual shot is a costly error that should be avoided since a missed shot makes a set of contiguous frames incoherent. Therefore, a shot transition detection method should be primarily evaluated on missed shots and not on false detections. (vii) Since videos in digital form occupy an enormous amount of data, it is desirable to be able to detect scene changes in compressed video to reduce the overall processing time. Fig. 8 shows, in a flow diagram form, our shot transition detection approach. The input to the system is a MPEG encoded video. The first step in our detection scheme is the extraction of I frames from the encoded video stream, as we use only I frames for change detection. This choice is based upon the consideration that

IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 5, October 1996

the I frames are the only accessible frames in MPEG encoded video that can be processed in an independent manner without reference to other encoded frames.

I frame selection

histogram computation

I hypothesis testing

I

exit Block diagrum of shot transition detection scheme Fig. 8

The second step consists of an intensity histogram computation for the extracted I frames. The histogram is computed using the first coefficient of each 8 x 8 DCT encoded block, which represents the average grey level value of block pixels. This scheme thus has a built-in spatial averaging at the block level, which, as indicated above, is needed to suppress frame-to-frame changes within a shot.

The third and the final stage of our shot transition detection system consists of comparing the current I frame histogram with the previous I frame histogram. This comparison is done by testing the following hypotheses: Ho: the two histograms are taken from the same shot (i.e. the same distribution). H I : the two histograms are taken from different shots.

The hypothesis testing is carried out by one of the three test statistics that are provided in our detection system. To minimise false detection due to large motion and gradual transitions, test coefficients are observed over three frames to make the final decision. Let the cut coefficient [Note I] for frame N and N + 1 be CN and CN+,. The cut point is then detected by evaluating the following relation:

True False otherwise

if CN 2 TK and CN+I < TK c u t = { where Tk is a suitable threshold value. The three statistical tests conducted are as follows.

4.1 Yakimovsky likelihood ratio test This test was proposed by Yakimovsky to detect the presence of an edge at the boundary of two regions [34]. It is based upon the computation of maximum likelihood estimates of the probabilities Po and PI that the hypotheses Ho and H I , respectively, are true. The estimated probabilities are computed under the assumption of Gaussian distribution. The expression Note 1: Defined and derived in next section

for the Yakimovsky likelihood ratio is given as follows:

where q2 and 022, 'espectively, represent the variances of the past and the current I frame histograms and oo2 the variance of the pooled data from both the histograms. The numbers m and n denote the number of elements in the two histograms respectively. Since all I frames are of same size, we have m = n, and therefore, we can use the following expression for the Yakimo- vsky likelihood ratio:

Y = ($) (2) A shot transition is said to occur at the Nth I frames if,

YN 2 TYK and YN+I < TYK where T , is a suitable threshold value for the test.

4.2 Chi-square test The chi-square test is the most accepted test for comparing two binned distributions [35]. Letting HP, and H q , respectively, represent the number of grey level blocks in bin j for the previous and current I frame histograms, the chi-square statistic is given as:

(HP, - HCJ2 x2 = (HP, + HC,)

The cut coefficient is computed using chi-square probability function x$(x2/v), which is an incomplete gamma function. It is defined as,

3

where v is number of degree of freedom, and equals the number of histogram bins. z(v) is the gamma function value of v, given by,

x" (v, 0) = 0 and x,' (v, 00) = 1. A value close to one for x$ indicates that the two I frames come from different scenes, or a shot boundary may exists between the two I frames. The transition point at Nth frame is detected if,

.(U) = (U - l)!

where Tx2 is some suitable threshold.

4.3 Kolm ogoro v-Sm irno v test The Kolmogorov-Smirnov (K-S) test is another popu- lar test for comparing two data sets [35]. It is based on calculating the cumulative distribution of two sets of data, histograms in our case. Letting CHP, and CHC, represent the cumulative histogram sum up to the bin j for the previous and current I frames respectively, the K-S statistic is given as:

D = max ICHP, - CHC,]

The K-S test is invariant to the reparameterisation of the distribution axis, but is very sensitive around the median value (i.e where CH = 0.5) and less sensitive to the two extreme points. One simpler approach getting around this problem is wrapping the distribution axis and looking for a statistic that is invariant under all shifts and parameterisation on the circle. One way of doing this is computing the K-S statistic as follows, D=D++D-=max[CHP,-CHC,]+rnax[CHC,-CHP,]

3

3 3

32 1 IEE Proc-Vis. Image Signal Process., Vol. 143, No. 5, October I996

The K-S statistic is then mapped to a monotonic function with limiting values in [0, I]. This is given by:

Q ( x ) = 2 x ( 4 j 2 z 2 - l)eP2’ ’

The K-S coefficient is calculated by computing the significance level of observed value of D as:

00 2 2

J=O

Q K S = Q ( [ 6 + 0 . 1 5 5 + - m where

Nl N2 N - e - N I + N2

N I and N2 are the number of data points in two distributions. A value close to one for Q indicates a shot transition between two frames. Again, the final decision is made by making observation of test coefficient over three frames. The Nth frame is decided as the new shot frame if,

where T,, is the test threshold value.

5 Experimental results

To evaluate the performance of the three suggested tests for shot change detection, a study has been carried out using five different videos of varying characteristics. These videos consist of three music videos, one news video, and one movie video. The three music videos are the ‘Madonna’ video, ‘Michael Jackson’ video and the ‘Mariah Carey’ video. The ‘Madonna’ video has fast motion and a mixture of close-ups and long shots. Being a music video, it also has many special effects. It has 597 frames and 25 cuts: 23 straight cuts and 2 optical cuts. The ‘Michael Jackson’ video is characterised by a large number of different types of optical cuts with very smooth transitions. Again, it is a fast paced video with a variety of shots in it. The ‘Michael Jackson’ video has 1710 frames and 23 cuts: 8 straight cuts and 15 optical cuts. The video from ‘Mariah Carey’ has large number of small cuts with fast-moving camera and many temporal cuts. It has 5096 frames with 110 cuts: 40 straight and 70 optical cuts. The news video is a recording from ‘CNN’ news. It has several long shots and different types of camera motions in it. The ‘CNN’ video is 4800 frames long with 24 straight cuts and a single optical cut. The movie video is a small segment from the movie ‘Indecent proposal’. The selected segment shows an indoor scene with many close ups and dark background. It has large camera and scene motion. The ‘Indecent proposal’ video also has 4017 frames with a total of 46 cuts: 40 straight cuts and 6 optical cuts.

The MPEG encoding of the videos was done through

a two-step process. This was necessitated due to the absence of proper hardware. The first step in the encoding process consisted of capturing the videos in the quick-time format on a Macintosh platform. These were then converted to MPEG format. Because of the video capture hardware constraint, a frame size of 160 x 120 pixels was used. As a result of this size, each I frame has 300 blocks. The video encoding was done with an intraframe ratio of 12 frames.

Since the chi-square test requires several elements in each bin for chi-square probability distribution to give a good approximation to the sampling distribution of the x2 statistic, 32 bins in the histogramming of I frames were used in our experiments. This ensured each bin with sufficient number of elements. It was also observed that the chi-square test has a higher sensitiv- ity to the change than other two tests. This made us choose higher threshold value, in our case T,z = 0.9 for the chi-square test, while slightly lower threshold values, TYK = TKs = 0.7, were chosen for the Yakimovski and K-S tests, respectively. These values were kept fixed throughout the experiments.

The results of our experiments are presented in Table 1. Each cut entry in the Table is separated by a colon (:) to separate the straight and optical cuts. The left side of the colon shows straight cuts while the right entry is for gradual cuts. As we can notice from these results, all Three Tests are highly successful at detecting cuts. There are very few misses and some expected false alarms. The false alarms for the ‘Indecent proposal’ video are high due to a large number of close-up shots with large motion in this video. The false detection for ‘Mariah Carey’ video is due to large number of optical cuts, where three frame filtering tends to give two or more cut points. The performance of the x2 test is espe- cially noteworthy. It is able to detect all cuts except one cut. At the same time, it has the lowest false alarm rate. These results reinforce an earlier study that the x2 test gives the best performance for scene change detection. However, while earlier studies have indicated higher miss-detection rate for x2 test, our results show a small miss-detection rate as well as a tolerable false alarm rate. The miss-detection rate for all three tests is lower than what has been reported in the literature thus far. We believe that these rates are due to the incorporation of large spatial averaging and the smaller number of bins in the histograms which minimise the effect of within shot changes.

6 Summary and future work

In this work, the problem of shot boundary detection was addressed in a compressed video. It was shown that it is possible to compute mean, variance and region histograms directly from the compressed video. This makes it possible to implement the majority of the

Table 1: Shot detection results with intraframe ratio 12

Video sequence Total frames True shots Located shots Missed shots False shots

YK CS KS YK CS KS YK CS KS

Madonna 597 22:2 24:l 24:2 25:2 0:l 0:O 0:O 2:O 2:O 3:O Michael Jackson 1710 8:15 14:14 9:15 10:9 0:l 0:O 1.6 1:0 1:0 3:O CNN news 4800 20:1 17:l 25:l 16:l 5:O 1:0 6:O 2:O 6:O 2:O Indecent proposal 4017 40:6 60:6 55:6 51:3 4:O 0:O 10:3 24:O 15:O 21:O Mariah Carey 5096 40:70 51:57 57:70 49:59 5:13 0:2 5:11 16:O 17:12 14:O

322 IEE Proc.-Vis. Image Signal Process.. Vol. 143, No. 5, October 1996

shot boundary detection methods proposed in the literature directly on the compressed video.

Additionally, the problem of shot boundary detection was addressed using the statistical hypothesis testing approach. Three test statistics were implemented to operate on a compressed video. The performance of these three tests were evaluated using four different videos of varying characteristics. The results show that shot boundary detection can be performed with high detection accuracy using the hypotheses testing approach. In particular, the performance of the x2 test was found extremely satisfactory with very high detection rate and low false alarm rate.

We are currently investigating the possibility of further improvements in our results, specially lowering the false alarm rate further, by performing additional tests based on the row and column histograms. We are also looking at performing shot motion classification on the basis of block motion information already present in the P frames of an encoded video stream. It is expected that such a motion classification scheme will further aid shot boundary detection and shot characterisation.

7 Acknowledgement

The authors thank Professor Ramesh Jam of the University of California, San Diego for his interest in our work. The MPEG codec used in our experiments is from University of California at Berkeley.

8

1

2

3

4

5

6

7

8

9

References

ARIJON, D.: ‘Grammar of the film language’ (Silman-James Press, Los Angeles, 1976) DAVENPORT, G., SMITH, T., and PINCEVER, N.: ‘Cine- matic primitives for multimedia’, Comp. Graph., 1991, 15, pp. 67- 74 MACKAY, W., and DAVENPORT, G.: ‘Virtual video editing in interactive multimedia applications’, Commun. ACM, 1989, 32,

SMOLIAR, S., and ZHANG, H.: ‘Content-based video indexing and retrieval’, ZEEE Multimedia, 1994, 1, pp. 62-72 SALT, B.: ‘Statistical style analysis of motion pictures’, Film Quart., 1973, 28, pp. 13-22 SEYLER, A.: ‘Probability distributions of television frame differences’, Proc. ZEEE, 1965, 53, pp. 355-366 COLL, D., and CHOMA, G.: ‘Image activity characteristics in broadcast television’, ZEEE Trans., 1976, C-26, pp. 1201-1206

pp. 802-810

SETHI, I., SALARI, V., and VEMURI, S.: ‘Imige sequence segmentation using motion coherence’. Proceedings of the first international conference on Computer vision, London, England, 1987, pp. 667-671 OTSUJI, K., TONOMURA, Y., and OHBA, Y.: ‘Video browsing using brightness data’. Proceedings of the SPIE conference on Visual communications and image processing, Boston, 1991, pp.

10 ZHANG, H., KANKANHALLI, A., and SMOLIAR, S.: ‘Auto- matic partitioning of full-motion video’, Multimedia Syst., 1993, 1, pp. 10-28

11 TONOMURA, Y.: ‘Video handling based on structured information for hypermedia systems’. ACM proceedings of the international conference on Multimedia information systems, Singapore, 1991, pp. 333-344

980-989

12 NAGASAKA, A., and TANAKA, Y.: ‘Automatic video indexing and full-video search for object appearances’. Proceedings of the 2nd working conference on Visual database systems, Budapest, Hungary, 1991, pp. 119-133

13 UEDA, H., MIYATAKE, T., SUMINO, S., and NAGASA- KA, A.: ‘Impact: an interactive natural-motion-picture dedicated multimedia authoring system’. Proceedings of CHI’91, 1991, New Orleans, pp. 343-350

14 ABE, S., TONOMURA, Y., and KASAHARA, H.: ‘Scene retrieval method for video database applications using temporal condition changes’. Proceedings of the conference on Machine intelligence and vision, Tokyo, Japan, 1989, pp. 355-359

15 NAGEL, H.: ‘Formation of an object concept by analysis of sys- tematic time variation in the optically perceptible environment’, Computer Graphics Image Proc., 1978, 7, pp. 149-194

16 GARGI, U,, OSWALD, S., COSIBA, D., DEVADIGA, S., and KASTURI R.: ‘Evaluation of video sequence indexing and heir- archical video indexing’. Proc. SPIE - Int. Soc. Opt. Eng., San Jose, 1995, Vol. 2420, pp. 144151

17 DEVADIGA, S., COSIBA, D., GARGI, U,, OSWALD, S., and KASTURI, R.: ‘Semiautomatic video database system’. Proc. SPIE - Int. Soc. Opt. Eng., San JosB, 1995, Vol. 2420, pp. 262- 267

18 AIGRAIN, P., and JOLY, P.: ‘The automatic real-time analysis of film editing and transition effects and its applications’, Com- puters Graphics, 1994, 18, pp. 93-103

19 SWANBERG, D., SHU, C.F., and JAIN, R.: ‘Knowledge guided parsing in video database’. Proceedings of IS&T SPIE’s symposium on Electronic imaging: science and technology, San Jose, Cal- ifornia, 1993

20 ZHANG, H., GONG, Y., SMOLIAR, S.W., and TAN, S.: ‘Automatic parsing of news video’. Proceedings of the IEEE multimedia conference, 1994, pp. 45-54

21 SHAHRARAY, B.: ‘Scene change detection and content-based sampling of video sequence’. Proc. SPZE - Znt. Soc. Opt. Eng., San Jost, 1995, Vol. 2419, pp. 2-13

22 ARMAN, F., HSU, A., and CHIU, M.-Y.: ‘Image processing on compressed data for large video databases’. Proceedings of the first ACM international conference on Multimedia, Anaheim, CA, 1993,

23 ZHANG, H., LOW, C.Y., GONG, Y., and SMOLIAR, S.: ‘Video parsing using compressed data’. ‘Proc. SPZE. Znt. Soc. Opt. Eng.’, 1994, 2182, pp. 142-149

24 DEARDORFF, E., LITTLE, T., MARSHALL, J., VENKATESH, D., and WLAZER, R.: ‘Video scene decomposition with the motion picture parser’. IS&T SPIE symposium on Electronic imaging science and technology, San JosC, 1994

25 YEO, B.-L., and LUI, B.: ‘Rapid scene change detection on com- presed video’, ZEEE Trans., 1996, CSTV-XX, pp. 00-00

26 JUAN. Y.. and CHANG. S.-F.: ‘Scene change detection in a MPEG compressed video sequence’. Proc. S P 6 - Znt. Soc. Opt. Eng., San JosB, 1995, Vol. 2419

27 LIU, H.-C.H., and ZICK, G.L.: ‘Scene decomposition of MPEG comvressed video’. Proc. SPIE - Znt. Soc. Oot. Ena.. San Jose. 199$ Vol. 2419,

28 CHERFAOUI, M., and BERTIN, C.: ‘Temporal segmentation of videos: a new approach‘. Proc. SPZE - Znt. Soc. Opt. Eng., San Jose, 1995, Vol. 2419

29 CHANG, S.-F., CHEN, W.L., and MESSERSCHMITT, D.G.: ‘Video compositing in dct domain’. IEEE workshop on Visual signal processing and communications, Rayleigh, North Carolina, 1992

30 CHANG, S.-F., and MESSERSCHMITT, D.G.: ‘Manipulation and compositing of mc-dct compressed video’, ZEEE J., 1995, SAC-13, pp. 1-11

31 SHEN, B., and SETHI, I.K.: ‘Inner-block operations for compressed images’. Proceedings of ACM multimedia, San Francisco, 1995, pp. 489498

32 SMITH, B., and ROWE, L.: ‘Algorithms for manipulating compressed images’, ZEEE Comp. Graphics Appl., 1993, pp. 4342

33 ROSENFELD, A., and KAK, A.C.: ‘Digital picture processing’ (Academic Press, New York, 1976)

34 YAKIMOVSKY, Y.: ‘Boundary and object detection in real world images’, J. Ass. Comp. Mach., 1976, 23, pp. 599-618

35 PRESS, W.H., TEUKOLSKY, S., VETTERLING, W., and FLANNERY, B.: ‘Numerical recipes in C’ (Cambridge University Press, 2nd ed., 1992)

v

IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 5, October 1996 323

compressed video processing for cut detection

Documents