the universityofnottinghamand hongkong baptist university

AN INFORMATION THEORETIC MODEL OF SPATIOTEMPORAL VISUAL SALIENCY

Guoping Qiu1 , Xiaodong Gu2, Zhibo Chen2, Quqing Chen2 and Charles Wang

'The University ofNottingham and Hong Kong Baptist University2Thomson Corporate Research, Beijing

ABSTRACT

This paper presents a principled and practical method for thecomputation of visual saliency of spatiotemporal events infull motion videos. Based on the assumption that uniquenessor informative-ness correlates with saliency, our modelpredicts the saliency of a spatiotemporal event based on theinformation it contains. To compute the uniqueness of thespatiotemporal events, we model the joint spatial andtemporal conditional probability distributions of thespatiotemporal events and compute their spatiotemporalsaliencies in a natural and integrated framework. To makethe information theoretic model practical, we havedeveloped methods to simplify the model and computationalprocess. Testing results on several video sequencesdemonstrate that our model is effective in predictingvisually salient spatiotemporal events and is comparable tostate of the art. It is expected that our principled andpractical model will find widespread applications inmultimedia content analysis and processing.

1. INTRODUCTION

Salient visual features and motions that attract humanattention can be important and powerful cues for visualinformation analysis and processing, including content-based coding, compression, transmission/rate control,indexing, browsing, display and presentation. Whilst therehave been a fair amount of previous work investigating thedetection and extraction of visually salient features in stillimages [1-3], not much has been done to address the sameproblem in video sequences. Although several authors havemade some initial attempts, e.g., [4, 5], most of thesemethods are rather ad hoc. A common approach of theseprevious attempts is that they first compute the spatial andtemporal saliencies independently and then fuse them insome rather arbitrary manners. For spatial saliency, thesemethods manipulate the contrasts of various visual features(intensity, colour, texture, etc.) in some heuristic ways togenerate the numerical scores of saliency. For temporal

+This work was done when the first author was visiting ThomsonCorporate Research Beijing

saliency, these methods assume that it is somehow related tovisual motion. They first detect motions based on someknown motion detection methods and then compute thetemporal saliency scores as some heuristically chosenfunctions ofmotion vectors.

Ad hoc and heuristic spatial and temporal visual saliencycomputational methods can have some serious weaknesses.First, motion estimation is difficult. Second, even withaccurate true motion estimation, it is not clear how exactlymotion relates to visual saliency and a quantitative measurethat relates motion with visual saliency is difficult to obtain.Third, assuming spatial and temporal saliencies can becorrectly computed, it is not clear how they should be fusedtogether. In most cases previous methods gave arbitraryweightings to temporal and spatial saliencies to obtain afinal score of spatiotemporal saliency.

Whilst most of the spatial visual saliency models arebased on the calculation of visual contrasts, e.g, [1], Topper[6] introduced the idea of using Shannon's self-informationto measure the perceptual saliency in still images. Recently,Bruce [2] picked up Topper's original suggestion andstudied the relationship between visual saliency and localstatistics, again, in still images.

In this paper, we extend the information theoreticapproach to visual saliency computation to full motion videoand have developed a principled and integrated practicalmethod for computing spatiotemporal saliency. Theorganization of the paper is as follows. In section 2, wedescribe the integrated spatiotemporal visual saliencymodel. In section 3, we present a practicable computationalmethod for the implementation of the model. In section 4,we present experimental results and section 5 concludes thepaper.

2. AN INFORMATION THEORETIC MODEL OFSPATIOTEMPORAL VISUAL SALIENCY

We wish to develop a computational model for predictingthe visual saliency of the pixels in each frame of a video. Inthis work, we adopt an information theoretic approach andassume that the "uniqueness" or "informative-ness"correlates with visual saliency. According to Shannon'sinformation theory, an event contains high information if itis unique; on the other hand, it contains low information ifthe event occurs frequently. The information of an event x,

1-4244-1017-7/07/$25.00 C2007 IEEE 1806 ICME 2007

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on January 28, 2010 at 04:45 from IEEE Xplore. Restrictions apply.

I(x), is inversely proportional to the likelihood of observingx, Shannon defined this quantity as

I(x) = - log(p(x)) (1)Clearly, according to (1), the computational task now

becomes that of finding the probability distributions of thepixels. Before formally presenting our model, we firstanalyze what factors are likely to affect the uniqueness ofthe pixels and how we can go about computing theprobability distributions of pixels in the context of a fullmotion video.

Instead of trying to compute the visual saliency of eachindividual pixel, we divide each frame into small patchesand we call a patch in a particular frame a spatiotemporalevent. The reason any given patch in any given frame is aspatial and temporal event is because it has a spatial as wellas a temporal context, as illustrated in Figure 1. For a givenspatiotemporal event, B(x, y, t), its spatial context is itscurrent frame F(t); its temporal context is defined as thespatiotemporal events with the same spatial location in theprevious N-I frames, V(x, y, t- 1)={B(x, y, t-l ), B(x, y, t-2),... , B(x, y, t-N+1)}.

y

x

Figure 1: An illustration of the spatial time relations ofspatiotemporal events. A spatiotemporal event is an m x n block(patch) of pixels, B(x, y, t), located at spatial co-ordinate (x, y) inthe frame at temporal (time) location t, F(t). N patches at spatiallocations (x, y) from frame F(t), F(t-l), ..., F(t-N+1) form a space-time 3D patch volume V(x, y, t)={B(x, y, t), B(x, y, t-1), ..., B(x, y,t-N+1)}, which records how the patch located at (x, y) evolves overtime.

The uniqueness of a spatiotemporal event B(x, y, t) isclearly affected by its spatial and temporal contexts. If anevent is unique in the spatial context, it is likely that it issalient; similarly, if it is unique in the temporal context it isalso likely to be salient. Obviously, the spatial and temporalcontexts jointly influence the uniqueness of a spatiotemporalevent. The task now becomes that of modeling these spatialand temporal contexts to predict the uniqueness or saliencyof the spatiotemporal events.

2.1 The Model

Based on the assumption that saliency is directly related touniqueness or informative-ness, the spatiotemporal saliencyscore of the spatiotemporal block B(x, y, t), SSS(x, y, t) canbe modeled by the amount of information contained inB(x, y, t). According to Shannon information theory (1), theinformation of the spatiotemporal event B(x, y, t) given itsspatial context F(t) and temporal context V(x,y,t-1) can becomputed as

SSS(x, y, t) = - log(p(B(x, y, t) V(x, y, t -1), F(t))) (2)

Equation (2) is our new information theoretic model ofspatiotemporal visual saliency which integrates spatial andtemporal information for predicting the visual saliency ofspatiotemporal events in a principled, concise and elegantmanner.

The model is concise and elegant, however, it isimportant to point out that its practical realization isextremely challenging. One of the reasons that informationtheory has not been widely used in the analysis of video andother very high dimensional data is that it involves someextremely difficult computational tasks. The difficultyprimarily stems from the "curse of dimensionality". From(2), our task of computing the saliency is to estimate theconditional probability. The dimensionality of the variablesinvolved in the computation is usually high and the spacethese variables reside grows exponentially with each newdimension added. As the dimensionality increases, more andmore samples are needed to estimate the probability. Forexample, for distributions in R6, even 2 million samples arenot sufficient to approximate the entropy correctly [7].Furthermore, we not only have to face the "curse ofdimensionality" problem, but also have to make thecomputational process reasonably fast in order to make themodel practically useful for applications such as real timevideo analysis and processing. In the next section, wepresent a practicable computational method for theimplementation of our new spatiotemporal visual saliencymodel.

3. COMPUTATIONAL METHODS

3.1 Model Simplification

The conditional probability of the spatiotemporal eventB(x, y, t) has two conditions, spatial condition F(t) andtemporal condition V(x, y, t-1). It is not difficult to see that itwould be very difficult to estimate this probability. In orderto develop practical solutions, we simplify the model byassuming that the spatial and temporal conditions areindependent. Based on this assumption, we can write thejoint conditional probability as

p(B(x,y,t) V(x,y,t -1),F(t))= (3)

p(B(x,y,t) V(x,y,t-1))p(B(x,y,t) F(t))

1807


With (3), we can now compute two conditionalprobabilities of the spatiotemporal event, one for the spatialcondition and the other for the temporal condition. Eventhough (3) has simplified the model somewhat, because thevariables involved are still of very high dimension, the taskof estimating these two conditional probabilities still posessome serious computational difficulties.

3.2 Temporal Conditional Probability

We can write

p(B(x, y, t) V(x, y t 1))= p(B(x, y, t), V(x, y, t -1))P(V(X' ,yt-1)) (4)

- p(V(x, y, t))p(V(x, y, t -1)

From (4), we see that the temporal conditionalprobability of the spatiotemporal event B(x,y,t) can becomputed by estimating the probability of the 3D space-timepatch volumes V(x,y,t) and V(x,y,t -1). Consider a 4x4 patchand a temporal context of 2 frames (N=2), thenV(x,y,t) ER32, estimating the probability in the 32D spacewill be impracticable requiring prohibitive large amount ofdata and computational effort.

To get round this computational difficulty and to makeour computational task tractable, we can try and transformour high dimensional spatiotemporal event vectors into aspace where each dimension is independent. Theindependence in each dimension allows us to decompose themultidimensional probability estimation problem into that ofestimating the distributions of multiple independent IDrandom variables.

Computationally, this can be done by performingindependent component analysis (ICA) as in [2]. However,ICA is an extremely under constrained problem andcomputationally too expensive for real time video analysis.Another way to overcome the computational difficulty is bysimplifying the assumption that the distributions of thespatiotemporal events is approximately normally distributed,in which case, we can transform the events into a spacewhere each dimension is uncorrelated. Given the normalassumption, uncorrelated implies independence. In thesignal processing literature, there exist a number oforthogonal transforms which can be used to project highdimensional data into uncorrelated spaces. PrincipalComponent Analysis (PCA) is one possibility. AlthoughPCA is optimal in the mean square error sense, its transformbases are data dependent and it can be computationallyexpensive. Instead, we use a well-known data independentorthogonal transform, the discrete cosine transform (DCT),extensively used in image and video coding, to transformour spatiotemporal events and volumes into uncorrelatedspace for the purpose of estimating the spatiotemporalevents distributions.

Let. k, k = 1, 2, ... K, be the K orthogonal transformbases, the procedure for computingp(V(x, y, t) follows thesesteps:

Step 1: Ck (X,y, t) =-kV(X,y, t) Vx,yStep 2: Compute the probabilities Pk p(ck(x, y, t)), VkStep 3: Compute p(V(x,y,t)) = n Pk

k

With this procedure, we can compute p(V(x, y, t) andp(V(x, y, t-1), which in turn enables us to compute thetemporal conditional likelihood of the spatiotemporal eventB(x, y, t) according to equation (4).

3.3 Spatial Conditional Probability

It is important to note that the spatial context that willinfluence the uniqueness of a spatiotemporal event B(x,y,t)is the current frame F(t) because only the current frame andB(x,y,t) will be simultaneously visible to the viewers. Theprobability p(B(x,y,t) F(t)) is therefore equivalent top(B(x,y,t)). With this, we only need to estimate thedistribution of the spatiotemporal event B(x,y,t) against allthe events in F(t). We can again resort to the simplificationassumption and technique of sub-section 3.2 to perform theestimation. The procedure can therefore proceed in thefollowing steps

Step 1: Ck (X,y, t)- /kB(x, y, t) Vx,yStep 2: Compute the probabilities Pk = P(ck(x, y, t)), VkStep 3: Compute p(B(x,y, t)) = n Pk

k

3.4 The Spatiotemporal Saliency Score

Follow (2) and (3) the spatiotemporal saliency score of thespatiotemporal event B(x,y,t) can be computed as

SSS(x,y, t) = - log(p(B(x,y, t) V(x, y, t -1), F(t)))- log(p(B(x, y, t) V(x, y, t -1))) - log(p(B(x, y, t) F(t))) (5)=,S(x,y,t)+ S, (X,y,t)

where St(x.y,t) and S,(x.y,t) are temporal and spatial saliencyscores respectively.

Note in (5) that the spatiotemporal saliency in our modelcan also be written as the sum of spatial and temporalcomponents. However, unlike previous methods, wherethese components are computed independently first and thenfused together in some arbitrary ways, in our model, thisdecomposition is natural and derived from the jointspatiotemporal saliency in a principled and naturally way.

4. EXPERIMENTAL RESULTS

To evaluate the effectiveness of our model, we haveimplemented the model using C++ program and tested on

1808


several video sequence. The video data format used in ourexperiment is CIF (352x288 pixels per frame). Forspatiotemporal event patch size = 4 x 4, and a temporalcontext N= 2, using all DCT coefficients except the DCcomponent for the estimation of the probabilitydistributions, and without any code optimization, ourimplementation is real time on a Pentium PC. This showsthat from a computational perspective our method ispractical.

A quantitative evaluation of results is difficult. Here wepresent some visual examples. We have also conductedexperiments using data from an eye tracking system andresults are competitive with other state of the art methods.

Figure 2 shows the saliency maps of some particularframes of two sequences used in the experiments, alsoshown are the spatial and temporal components of thesaliency.

Figure 2: Spatiotemporal, temporal and spatial saliency scores oftwo example sequences. Left: 118th frame of the Canoe sequence.Right: 125th frame of the Soccer sequence. Top row: Thespatiotemporal saliency score (SSS). Middle: The spatialcomponent of the SSS (St). Bottom: The temporal component ofthe SSS (S,). Note that in both sequences, the spatial saliencyscores are rather noisy while the temporal saliency scores arerelatively clean.

Figure 3 shows a comparison of our model with the methodin [1]. It is seen that our method performs similarly or betterthan that in [1]. More example result of our method isshown in Figure 4.

5. CONCLUDING REMARKS AND FUTURE WORK

In this work, we have developed an information theoreticbased spatiotemporal visual saliency model for predictingthe visual saliency of spatiotemporal events in full motionvideo. Contrast to previous ad hoc approaches, we derive

Our model is currently being applied to variousmultimedia processing tasks including video summarizationand object tracking.

Figure 3: Comparison with the method in L1] (based on our ownimplementation of the method in [1] for video). Left: originalframe, Middle: top 20 salient regions detected by the method of[1], Right: top 20 salient regions detected by our method. Note inthe figure, for each salient patch, pixels within a radius of 69 pixelsare shown.

Figure 4: More examples. From left to right: original frame image,top 20 salient regions (for each salient patch, pixels within a radiusof 69 pixels are shown), another original frame, top 20 salientregions.

6. REFERENCES

[1] L. Itti, C. Koch, E. Niebur, "A Model of Saliency-Based VisualAttention for Rapid Scene Analysis," IEEE Trans. On PatternAnalysis and Machine Intelligence, 1998[2] Neil D.B. Bruce, "Features that draw visual attention: aninformation theoretic perspective", Neurocomputing, vol.65-66,pp.125-133, 2005[3] T Kadir andM Brady. "Scale, saliency and image description",IJCV, 45(2):83-105, November 2001[4] Y-F Ma, et al, "A User Attention Model for VideoSummarization", Proceedings ofACM Multimedia, 2003[5] Y. Zhai and M. Shah, "Visual Attention Detection in VideoSequences Using Spatiotemporal Cues", Proceedings of ACMMultimedia, 2006[6] T.N. Topper, "Selection mechanisms in human and machinevision", Ph.D. Thesis, University of Waterloo, 1991[7] D.B.Russakoff, C.Tomasi, T.Rohlfing and C.R.Maurer, Jr.,"Image Similarity Using Mutual Information of Regions", ECCV(3) 2004:596-607

our concise and elegant model in a principled way guided byinformation theory. To overcome the inherent computationaldifficulties associated with the use of information theory, wehave developed fast and practical computational solutionsfor the implementation of our model. Experimental resultsfurther demonstrate the effectiveness and practicability ofour new method.

1809


the universityofnottinghamand hongkong baptist university

Documents