[acm press the 14th annual acm international conference - santa barbara, ca, usa...

Visual Attention Detection in Video Sequences UsingSpatiotemporal Cues

Yun ZhaiUniversity of Central Florida

Orlando, Florida 32816

[email protected]

Mubarak ShahUniversity of Central Florida

Orlando, Florida 32816

[email protected]

ABSTRACTHuman vision system actively seeks interesting regions inimages to reduce the search effort in tasks, such as objectdetection and recognition. Similarly, prominent actions invideo sequences are more likely to attract our first sightthan their surrounding neighbors. In this paper, we pro-pose a spatiotemporal video attention detection techniquefor detecting the attended regions that correspond to bothinteresting objects and actions in video sequences. Bothspatial and temporal saliency maps are constructed andfurther fused in a dynamic fashion to produce the overallspatiotemporal attention model. In the temporal attentionmodel, motion contrast is computed based on the planar mo-tions (homography) between images, which is estimated byapplying RANSAC on point correspondences in the scene.To compensate the non-uniformity of spatial distribution ofinterest-points, spanning areas of motion segments are in-corporated in the motion contrast computation. In the spa-tial attention model, a fast method for computing pixel-levelsaliency maps has been developed using color histograms ofimages. A hierarchical spatial attention representation is es-tablished to reveal the interesting points in images as wellas the interesting regions. Finally, a dynamic fusion tech-nique is applied to combine both the temporal and spatialsaliency maps, where temporal attention is dominant overthe spatial model when large motion contrast exists, andvice versa. The proposed spatiotemporal attention frame-work has been applied on over 20 testing video sequences,and attended regions are detected to highlight interestingobjects and motions present in the sequences with very highuser satisfaction rate.

Categories and Subject DescriptorsI.2.10 [Artificial Intelligence]: Vision and Scene Under-standing, Perceptual Reasoning.; I.4.10 [Image Process-ing and Computer Vision]: Image Representation.

General TermsAlgorithms.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM’06, October 23–27, 2006, Santa Barbara, California, USA.Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.

KeywordsVideo Attention Detection, Spatiotemporal Saliency Map.

1. INTRODUCTIONHow to achieve a meaningful video representation is an

important problem in various research communities, such asmultimedia processing, computer vision and content-basedimage and video retrieval. The effectiveness of the represen-tation is determined by how well it fits to human perceptionand reaction to external visual signals. Human perceptiontends to firstly pick the regions in the imagery that stimulatethe vision nerves the most before continuing to interpret therest of the scene. These attended regions could correspondto either prominent objects in the image or interesting ac-tions in video sequences. Visual attention analysis simulatesthis human vision system behavior by automatically produc-ing saliency maps of the target image or video sequence. Ithas a wide range of applications in tasks of image/videorepresentation, object detection and classification, activityanalysis, small-display device control and robotics controls.Visual attention deals with detecting the regions of interest(ROI) in images and interesting actions in video sequencesthat are the most attractive to viewers. For example, in thetask of object/action detection, visual attention detectionsignificantly narrows the search range by giving a hierarchi-cal priority structure of the target image or sequence. Con-sider the following scenario, a video sequence is captured bya camera that is looking at a classroom entrance. At thetime the class is dismissed, the majority of the students willbe going out of the classroom. In this situation, if two peopleare trying to walk back into the room, their actions wouldbe considered “irregular” compared to the rest of the stu-dents. The aim of attention analysis is to quickly highlightthe abnormal regions and perform further activity analysison these regions.

1.1 Related WorkVisual attention detection in still images has been long

studied, while there is not much work on the spatiotemporalattention analysis. Psychology studies suggest that humanvision system perceives external features separately (Treis-man and Gelade [27]) and is sensitive to the difference be-tween the target region and its neighborhood (Duncan andHumphreys [6]). Following this suggestion, many works havefocused on the detection of feature contrasts to detect atten-tion. This is usually referred as the “stimuli-driven” mech-anism. Itti et al. [10] proposed one of the earliest works invisual attention detection by utilizing the contrasts in color,

815

intensity and orientation of images. Han et al. [8] detectedthe attended object detection using the Markov random fieldwith the use of visual attention and object growing. Ma andZhang [15] incorporated a fuzzy growing technique in thesaliency model for detecting different levels of attention. Luet al. [14] used the low-level features, including color, tex-ture and motion, as well as cognitive features, such as skincolor and faces, in their attention model. Different types ofimages have also been exploited. Ouerhani and Hugli [19]have proposed an attention model for range images usingthe depth information. Besides the heavy investigation us-ing the stimuli-driven approach, some methods utilize theprior knowledge on what the user is looking for. Milaneseet al. [16] constructed the saliency map based on both low-level feature maps and object detection outputs. Oliva etal. [18] analyzed the global distributions of low-level fea-tures to detect the potential locations of target objects. Afew researchers have extended the spatial attention to videosequences where motion plays an important role. Cheng etal. [4] has incorporated the motion information in the at-tention model. The motion attention model analyzes themagnitudes of image pixel motion in horizontal and verticaldirections. Bioman and Irani [2] have proposed a spatiotem-poral irregularity detection in videos. In their work, insteadof using real motion information, textures of 2D and 3Dvideo patches are compared with the training database todetect the abnormal actions present in the video. Meur etal. [12] proposed a spatiotemporal model for visual atten-tion detection. Affine parameters were analyzed to producethe motion saliency map.

Visual attention modelling has been applied in many fields.Baccon et al. [1] proposed an attention detection techniqueto select spatially relevant visual information to control theorientation of a mobile robot. Driscoll et al. [5] built a pyra-midal artificial neural network to control the fixation pointof a camera head by computing the 2D saliency map of theenvironment. Chen et al. [3] applied the visual attentiondetection technique in devices with small displays. Interest-ing regions with high saliency values have higher priority tobe displayed comparing to the rest of the image. Attentionmodels were used in image compression tasks by Ouerhaniet al. [20] and Stentiford [25], where regions with higher at-tention values were compressed with higher reconstructionquality. Peters and Sullivan [22] have applied visual atten-tion in computer graphics to generate the gaze direction ofvirtual humans.

1.2 Proposed FrameworkVideo attention methods are generally classified into two

categories: top-down approaches and bottom-up approaches.Methods in the first category, top-down approaches, are task-driven, where prior knowledge of the target is known beforethe detection process. This is based on the cognitive knowl-edge of the human brain, and it is a spontaneous and volun-tary process. Traditional rule-based or training-based objectdetection methods are the examples in this category. Onthe other hand, the second category, bottom-up approaches,are usually referred as stimuli-driven. This is based on thehuman reaction to external stimuli, such as bright color,distinctive shape or unusual motion, and it is a compulsoryprocess.

In this paper, we propose a bottom-up approach for mod-elling the spatiotemporal attention in video sequences. The

proposed technique is able to detect the attended regionsas well as attended actions in video sequences. Previoustemporal attention detection methods utilized the dense op-tical flows, which is will known for its instability around themotion boundaries and inside the textureless areas. Fur-thermore, dense optical flow fields based methods producepixel-level saliency maps, while users often are interested inobject/region level attentions. Different from the previousmethods, our proposed temporal attention model utilizes theinterest point correspondences and the geometric transfor-mations between images. In this model, feature points arefirstly detected in consecutive video images, and correspon-dences are established between the interest-points using theScale Invariant Feature Transformation (SIFT [13]). Theadvantage of using point correspondences (or sparse opticalflows) is that, they are much more stable and accurate thanthe dense optical flows. RANSAC algorithm is then appliedon the point correspondences to find the moving planes inthe sequence by estimating their homographies. Projectionerrors of the interest points by the estimated homographiesare incorporated in the motion contrast computation.

In our proposed spatial attention model, we have con-structed a hierarchical visual saliency representation. A lin-ear time order algorithm is developed to compute the pixel-level saliency maps. In this algorithm, color statistics ofthe images (color histograms) and a pre-computed color dif-ference table are used to reveal the color contrast informa-tion in the scene. Given the pixel-level saliency map, at-tended points are detected by finding the image pixels withlocal maxima saliency values. The region-level spatial atten-tion is constructed based upon the detected attended points.Given an attended point, a unit region is created with theattended point as its center. This region is then iterativelyexpanded by computing the expansion potentials along thesides of that region. Rectangular-shape attended regions areachieved when the stopping criteria applies. The detectedattended regions often correspond to the target objects inthe scene. The temporal and spatial attention models arefinally combined by a dynamic fusion technique. Higherweights are assigned to the temporal model if large mo-tion contrast is present in the sequence. Otherwise, higherweights are assigned to the spatial model if less motion ex-ists. This is due to the reason that human vision system ismore sensitive to the moving object than the other cues.

One major advantage of our proposed method comparedto the previous methods is that, we are able to achieveregion/object level attention in addition to the pixel-levelsaliency computation. This will be very useful in the fu-ture tasks such as object/activity detection and classifica-tion, which require region level processing. The work flowof the proposed attention detection framework is describedin Figure 1. To demonstrate the effectiveness of our spa-tiotemporal attention framework, we have applied it to over20 testing videos, which include both sequences with mov-ing objects and the ones with uniform global motions. Verysatisfactory results have been obtained. The remainder ofthis paper is organized as follows: The temporal and spa-tial attention models are presented in Section 2 and Section3, respectively, with the corresponding intermediate results.Section 4 describes the proposed dynamic fusion methodto combine the two individual attention models. Section 5presents the experimental results and the performance eval-uations. Finally, Section 6 concludes our work.

816

Spat

iote

mpo

ral

Salie

ncy

Map

+

Interest-Point Correspondences

Inpu

t Vid

eo

Sequ

ence

Pixel-Level Saliency Map Computation

Temporal Saliency Detection Using

Homography

Hierarchical Attention Representation

+Dynamic

Model Fusion

Temporal Attention Model

Spatial Attention Model

Figure 1: Work flow of the proposed spatiotemporal attention detection framework. It consists of two components,

temporal attention model and spatial attention model. These two models are combined using a dynamic fusion

technique to produce the overall spatiotemporal saliency maps.

2. TEMPORAL ATTENTION MODELIn the temporal attention detection, saliency maps are

often constructed by computing the motion contrast usingthe dense optical flows of the pixels, and their limitationsaround the motion boundaries and textureless regions havebeen described in Section 1.2. In contrast, point correspon-dences (also known as the sparse optical flows) between im-ages are comparatively accurate and stable. In this section,we propose a novel approach for computing the temporalsaliency map using the point correspondences and geomet-ric transformations between video images.

Given images in a video sequence, feature points are lo-calized in each image using the point detection method.Correspondences between the matching points in consecu-tive frames are further established by analyzing the proper-ties of image regions around the points. In our framework,we have applied the Scale Invariant Feature Transformation(SIFT [13]) operator to find the interest points and com-pute the correspondences between points in video frames.Let pm = (xm, ym) be the m-th point in the first imageand p′m = (x′m, y′m) be its correspondence in the secondimage. One example of the interesting point matching isshown in Figure 2. Given the point correspondences, thetemporal saliency value SalT (pi) of point pi is computedby modelling the motion contrast between the target pointand other points,

SalT (pi) =

n∑j=1

DistT (pi,pj), (1)

where n is the total number of correspondences. DistT (pi,pj)is some distance function between pi and pj . In our formu-lation, we analyze the geometric transformations betweenimages. The motion model used is homography. Homogra-phy is used for modelling the planar transformations. Theinteresting point p = [x, y, 1]T and its correspondence p′ =[x′, y′, 1]T can be associated by,

x′

y′

w′

=

a1 a2 a3

a4 a5 a6

a7 a8 1

xy1

. (2)

Here, p′ = [x′, y′, w′]T is the projection of p in the formof homogeneous coordinates. Parameters {ai, i = 1, · · · , 8}capture the transformation between two matching planes,

and they can be estimated by providing at least four pairsof correspondences. For simplicity, we use H to representthe transformation matrix in the rest of the text. Also,we normalize pi, such that its third element is 1. Ideally, p′

should be the same as p′. With noise present in the imagery,a point p′ matches with p′ with an error computed afterapplying H, as,

ε(pi,H) =‖ p′i − p′i ‖ . (3)

Motions of objects are only meaningful when certain ref-erence is defined. For instance, a car is said “moving” onlyif visible background is present in the scene and disagreeswith the car in terms of the motion direction. This factindicates that multiple moving objects are in the scene toindicate local motion existence. In these types of situations,a single homography is insufficient to model all the corre-spondences in the imagery. To overcome this problem, weapply RANSAC algorithm on the point correspondences toestimate multiple homographies that model different motionsegments in the scene. The homographies are later used inthe temporal saliency computation process.

For each homography Hm estimated by RANSAC, a listof points Lm = {pm

1 , · · · ,pmnm} are considered as its inliers,

where nm is the number of inliers for Hm. Given the ho-mographies and the projection error definition in Eqn.3, wecan define the motion contrast function in Eqn.1 as,

DistT (pi,pj) = ε(pi,Hm), (4)

where pj ∈ Lm. The temporal saliency value of each pointp is then computed as,

SalT (p) =

M∑j=1

fj × ε(p,Hj), (5)

where fj is the size of the inlier set of Hj . From Eqn.5,it is apparent to see that the sizes of the inlier sets playdominant role in the temporal saliency computation. How-ever, it is well known that the spatial distribution of theinterest points is not uniform due to variance in the texturecontents of image parts. Sometimes, relatively larger mov-ing objects/regions may contribute less trajectories, whilesmaller regions but with richer texture provide more trajec-tories. One example is shown in Figure 2. In these cases, the

817

(a) Image 1 (b) Image 2 (c) Motion Regions

Figure 2: One example of the point matching and motion segmentation results. Figure (a) and figure (b) show two

consecutive images. The interest points in both images and their correspondences are shown. The motion regions are

shown in figure (c).

current temporal saliency definition is not realistic. Largerregions with less points, which often belong to the back-grounds, will be assigned higher attention values. Whileforeground objects, which are supposed to be the true at-tended regions, will be assigned with lower attention values,if they possess more interest-points. To avoid this problem,we incorporate the spanning area information of the movingregions. The spanning area of a homography Hm is com-puted as,

αm =(max(xm

i )−min(xmi )

)×

(max(ym

i )−min(ymi )

), (6)

where ∀pmi ∈ Lm, and αi is normalized with respect to the

image size, such that αi ∈ [0, 1]. In the extreme cases, wheremax(xm

i ) = min(xmi ) or max(ym

i ) = min(ymi ), to avoid zero

values of αm, the corresponding term in Eqn.6 is replacedwith a non-zero constant number (in the experiment, weuse 0.1). The temporal saliency value of a target point p isfinally computed as,

SalT (p) =

M∑j=1

αj × ε(p,Hj), (7)

where M is the total number of homographies in the scene.In the degenerated cases, where some point correspondencesdo not belong to inlier sets of any of the estimated homogra-phies, we apply a simplified form of the homography to eachof these point correspondences. Suppose {pt,p

′t} is one of

the “left-out” correspondences. The transformation is de-fined as a translation matrix Ht = [1 0 dt

x; 0 1 dty; 0 0 1],

where dtx = x′t − xt and dt

y = y′t − yt, and the inlier setLt = pt.

Up to this point, we have the saliency values of individ-ual points and the spanning regions of the homographies,which correspond to the potential moving objects in thescene. To achieve object-level attention for Hm, the av-erage of the saliency values of the inliers Lm is consideredas the saliency value of the corresponding spanning region.All the image pixels in the same spanning region have thesame saliency value. Since the resulting regions are rectan-gular, it is likely that an image pixel is covered by multiplespanning regions. In this case, the pixel is assigned with thehighest saliency value possible. If the pixel is not covered byany spanning region, its saliency value is set to zero. Oneexample of the proposed temporal saliency map computa-tion is demonstrated in Figure 3, where the camera follows

(a) Image 1 (b) Image 2

(c) Point Correspondences (d) Temporal Saliency Map

Figure 3: Example of the proposed temporal attention

model. Figures (a) and (b) show two consecutive images

of the input sequence. Figure (c) shows the interest-

point correspondences between (a) and (b). Figure (d)

shows the detected temporal saliency map using the pro-

posed homography-based method. Rectangular shape

attended regions are achieved from (d). In this example,

the camera follows the moving toy train from right to left

accompanied with a small zooming-out motion, while the

calender is also moving downward. Thus, intuitively, the

attention region should correspond to the toy train. The

saliency map also suggests that the second attended re-

gion corresponds to the moving calender. Brighter color

represents higher saliency value.

a moving toy train from right to left. The motions in thescene include the dominant global panning to the left withslightly zooming-out and the local motion caused by movingobject, such as calendar, ball and toy train. Apparently theattention region in the sequence corresponds to the movingtoy train.

3. SPATIAL ATTENTION MODELWhen viewers watch a video sequence, they are attracted

not only by the interesting activities, but also sometimesby the interesting objects present in the still images. Thisis referred as the spatial attention. Based on the psycho-logical studies, human perception system is sensitive to the

818

00 255

255

Figure 4: The distance map between the gray-level color

values, which can be computed prior to the pixel-level

saliency map computation. Brighter elements represent

larger distance values.

contrast of visual signals, such as color, intensity and tex-ture. Taking this as the underlying assumption, we proposean efficient method for computing the spatial saliency mapsusing the color statistics of images. Utilizing the image colorhistograms, the proposed algorithm is designed with a lin-ear computational complexity with respect to the numberof image pixels. The saliency map of an image is built uponthe color contrast between image pixels. The saliency valueof a pixel Ik in an image I is defined as,

SalS(Ik) =∑

∀Ii∈I

‖ Ik − Ii ‖, (8)

where the value of Ii is in the range of [0, 255], and || · ||represent the distance metric between color values. Thisequation is expanded to have the following form,

SalS(Ik) = ||Ik − I1||+ ||Ik − I2||+ · · ·+ ||Ik − IN ||, (9)

where N is the total number of pixels in the image. Givenan input image, the color value of each pixel Ii is known.Let Ik = am, and Eqn.9 is further restructured, such thatthe terms with the same Ii are rearranged to be together,

SalS(Ik) = ||am − a0||+ · · ·+ ||am − a1||+ · · ·+ · · · ,

SalS(am) =

255∑n=0

fn||am − an||, (10)

where fn is the frequency of pixel value an in the image. Thefrequencies are expressed in the form of histograms, whichcan be computed in O(N) time order. Since an ∈ [0, 255],the color distance metric ‖ am − an ‖ is also bounded inthe range of [0, 255]. Since this is a fixed range, a distancemap D can be constructed in constant time prior to thesaliency map computation. In this map, element D(x, y) =‖ax−ay ‖ is the color difference between ax and ay. One colordifference map is shown in Figure 4. Given the histogramf(·) and the color distance map D(·, ·), the saliency value fora pixel Ik is computed as,

SalS(Ik) = SalS(am) =

255∑n=0

fnD(m, n), (11)

which executes in a constant time order. Thus, instead ofcomputing the saliency values of all the image pixels usingEqn.8, only the saliency values of colors {ai, i = 0, · · · , 255}

Histogram of R-channel

Saliency values of colors

Input Image Saliency Values of Colors Saliency Map

Figure 5: An example of the spatial saliency computa-

tion. The left figure shows the input image. The center-

top figure shows the histogram of the R-channel of the

image, while the center-bottom figure shows the saliency

values of the colors. The horizontal axis represents the

values of the colors, where an ∈ [0, 255]. The saliency

values are close to what human expects, since higher

frequency indicates repeating information in the image,

and therefore, relatively unattractive. The right figure

shows the resulting spatial saliency map.

s1

s3

s2

s4

s1’

s2’

s3’

s4’

Figure 6: An example of the attended region expansion

using the pixel-level saliency map. A seed region is cre-

ated on the left. Expanding potentials on all four sides

of the attended region are computed (shaded regions).

The lengths of the arrows represent the strengths of the

expansions on the sides. The final attended region is

shown on the right.

are necessary for the generation of the final saliency map.One example of the pixel-level spatial saliency computationis shown in Figure 5.

Inspired by the work presented in [15], we propose a hier-archical representation for the spatial attention model basedon the pixel-level saliency map computed previously. Twolevels of attentions are achieved: attended points and at-tended regions. Attended points are analogous to the di-rect response of human perception system to external sig-nals. They are computed as the image pixels with the lo-cally maximum spatial saliency values. On the other hand,region-level attention representation provides attended ob-jects in the scene. One simple way to achieve the attendedregions is to apply the connected-component algorithm tofind the bright regions. However, as shown in Figure 5,pixels with low attention values are embedded in high-valueregions. Connected-component algorithm will fail to includethese pixels in the attended region. Furthermore, connected-component method tends to generate over-detection of theattended regions. In this paper, we present a region growingtechnique for detecting the attended regions, which is ableto resolve the above mentioned problems. In our formula-tion, the attended regions are firstly initialized based on theattended points computed previously. Given an attendedpoint c, a rectangular box centered at c with the unit di-mensions is created as the seed region Bc. The seed regionis then iteratively expanded by moving its sides outward by

819

(a) (b) (c) (d) (e)

Figure 7: The results of spatial attention detection on two testing images. Column (a) shows the input images; column

(b) shows the pixel-level spatial saliency maps; column (c) presents the detected attention points; column (d) shows

the expanding boxes from the attention points in (c); finally, column (e) shows the region-level saliency maps of the

images.

analyzing the energy around its sides. The attended regionexpansion algorithm is described as follows,

1. For each side i ∈ {1, 2, 3, 4} of region B with lengthli, two energy terms E(si) and E(s′i) are computed forboth its inner and outer sides si and s′i, respectively,as shown in Figure 6. The potential for expanding sidei outward is defined as follows,

EP (i) =E(si)E(s′i)

l2i, (12)

where l2i is for the purpose of normalization.2. Expand the region by moving side i outward with a

unit length if EP (i) > Th, where Th is the stoppingcriteria for the expansion. In the experiment, the unitlength is 1 pixel.

3. Repeat steps 1 and 2 until no more side of B can befurther expanded, i.e., all the corresponding expansionpotentials are below the defined threshold.

It should be noted that the expansion potential defined inEqn.12 is designed in such a way, that the attended region isexpanded if and only if both the inner and outer sides havehigh attention values. The expansion stops at the boundarybetween the high value regions representing the interestingobjects and the low value regions for the background. Ademonstration of the expanding process is shown in Figure 6.It is possible that the attended regions initiated using differ-ent attended points eventually cover the same image region.In this case, a region merging technique is applied to mergethe attended regions that cover the same target image re-gion by analyzing the overlapping ratio between the regions.To be consistent with the temporal attention model, the fi-nal spatial saliency map reveals the attended regions in therectangular shapes. Detailed results of the spatial attentiondetection on two images are shown in Figure 7.

4. DYNAMIC MODEL FUSIONIn the previous sections, we presented the temporal and

spatial attention models separately. These two models needto collaborate in a meaningful way to produce the final spa-tiotemporal video saliency maps. Psychological studies re-veal that, human vision system is more sensitive to motion

PVarTW

eigh

t V

alue

Const.=0.3

κ κ κ κT

κ κ κ κS

Figure 8: Plots of the dynamic weights, κT and κS ,

with respect to the pseudo-variance PV arT of the tem-

poral saliency map, where Const = 0.3. As it is clear in

the figure, the fusion weight of the temporal attention

model increases with the pseudo-variance of the tempo-

ral saliency values.

contrast compared to other external signals. Consider avideo sequence, in which the camera is following a personwalking, while the background is moving in the opposite di-rection of the camera’s movement. In general, humans aremore interested in the followed target, the walking person,instead of the his surrounding regions, the background. Inthis example, motion is the prominent cue for the attentiondetection compared to other cues, such as color, texture andintensity. On the other hand, if camera is being static or onlyscanning the scene, in which motion is relatively uniform,then the human perception system is attracted more by thecontrasts caused by other visual stimuli, such as color andshape. In summary. Thus, we propose the following criteriafor the fusion of temporal and spatial attention models,

1. If strong motion contrast is present in the sequence,temporal attention model should be more dominantover the spatial attention model.

2. On the other hand, if the motion contrast is low in thesequence, the fused spatiotemporal attention modelshould incorporate the spatial attention model more.

Based on these two criteria, simple linear combinationwith fixed weights between two individual models is notrealistic and would produce unsatisfactory spatiotemporal

820

(a) (b)

(c) (d) (e)

Figure 9: An example of model fusion. This sequence

shows two people sitting and another person walking in

front of them. (a) is the key-frame of the input sequence.

(c) shows the temporal saliency map based on the pla-

nar transformations. (d) shows the region-level spatial

saliency map using color contrast. (e) is the combined

spatiotemporal saliency map. Obviously, the moving ob-

ject (the walking person) catches more attention than

the still regions (sitting persons). Thus, it is assigned

higher attention values. The region that corresponds to

the interesting action in the scene is shown in (b).

saliency maps. Rather, we propose a dynamic fusion tech-nique, which satisfies the aforementioned criteria. It givesa higher weight to the temporal attention model, if highcontrast is present in the temporal saliency map. Similarly,it gives a higher weight to the spatial model, if the motioncontrast is relatively low.

Finally, the spatiotemporal saliency map of an image I inthe video sequence is constructed as,

Sal(I) = κT × SalT (I) + κS × SalS(I), (13)

where κT and κS are the dynamic weights for the tempo-ral and spatial attention models, respectively. These dy-namic weights are determined in terms of the variance ofSalT (I). One special situation needs to be considered care-fully. Consider one scene with a moving object whose size isrelatively small compared to the background. The varianceof the temporal saliency map in this case would be low bythe overwhelming background saliency values and does nottruly reflect the existence of the moving object. In this case,we compute a variance-like measure, pseudo-variance, whichis defined as PV arT = max(SalT (I)) −median(SalT (I)).The weights κT and κS are then defined as,

κT =PV arT

PV arT + Const, κS =

Const

PV arT + Const, (14)

where Const is a constant number. From Eqn.14, if themotion contrast is high in the temporal model, then thevalue of PV arT increases. Consequently, fusion weight ofthe temporal model, κT , is also increased, while the fusionweight of the spatial model, κS , is decreased. The plots ofκT and κS with respect to PV arT are shown in Fig.8. Oneexample of the spatiotemporal attention detection is shownin Fig.9, which shows a person is walking in front of thetwo sitting people. The moving object (walking person) ishighlighted by the detected attention region.

(a) (b) (c) (d) (e)

Figure 10: Spatiotemporal attention detection results

for the testing videos in Testing Set 1. Column (a)

shows the representative frames of the videos; column

(b) shows the temporal saliency maps; column (c) shows

the spatial saliency maps; column (d) shows the fused

spatiotemporal saliency maps; and column (e) shows the

regions that correspond to potential interesting actions

in clips. It should be noted that when rich texture exists

in the scene, temporal attention model is able to detect

the attended regions using motion information, while the

spatial model fails.

5. PERFORMANCE EVALUATIONTo demonstrate the effectiveness of the proposed spatiotem-

poral attention model, we have applied the method on twotypes of video sequences, labelled Testing Set 1 and Test-ing Set 2. The testing sequences are obtained from featurefilms and television programs (TRECVID 2005 BBC rushesdata [26]). Testing Set 1 contains nine video sequences, eachof which has one object moving in the scene, such as mov-ing car or flying airplane. The detailed results of TestingSet 1 are shown in Figure 10. The following information ispresented: the representative frames of the testing videos(Figure 10(a)), the temporal saliency maps of the represen-tative frames (Figure 10(b)), the spatial saliency maps of therepresentative frames (Figure 10(c)), the final spatiotempo-ral saliency maps (Figure 10(d)) and the detected regionsthat correspond to the prominent actions in the videos (Fig-ure 10(e)). It should be noted that, for those videos thathave very rich texture (videos 2, 3, 4 and 5), the spatial

821

(a) (b) (c) (d) (e) (a) (b) (c) (d) (e)

Figure 11: Spatiotemporal attention detection results for the testing videos in Testing Set 2. Only the spatial saliency

maps are shown, since there is no motion contrast in the scene. Column (a) shows the representative frames of the

videos; column (b) shows the pixel-level spatial saliency maps; column (c) shows the extended bounding boxes using

the expansion method proposed in Section 3; column (d) shows the detected attended points in the representative

frames; and column (e) shows the detected attended regions in the images. Corresponding evaluation results are shown

in Figure 12. Note that column (e) shows different information from column (c). If the extended bounding boxes

overlaps (second example in the first row), they are merged to produce a single attended region in the scene. Also,

small bounding boxes are removed in the attended region generation.

attention model generates meaningless saliency maps. How-ever, with the help of the proposed dynamic model fusiontechnique, the temporal attention model becomes dominantand is able to detect the regions where interesting actionshappen (as shown in Figure 10(e)). This exactly fits withthe human perceptual reactions to motion contrast in thesetypes of situations regardless of visual texture in the scene.

The second testing set, Testing Set 2, contains video se-quences without prominent motions. The videos mainly fo-cus on the static scene settings or with uniform global mo-tions, i.e., there is no motion contrast in the scene. In thiscase, the spatial attention model should be dominant overthe temporal model. Some results on the testing sequencesin Testing Set 2 are shown in Figure 11. The results in-clude the following: the representative key-frames of videos(Figure 11(a)), the pixel-level spatial saliency maps (Fig-ure 11(b)), the expanding regions (Figure 11(c)), the at-tended points in the representative frames (Figure 11(d))and the attended regions in the representative frames (Fig-ure 11(e)). Temporal saliency maps are not shown sincethey are uniform and carry less information.

Assessing the effectiveness of a visual attention detectionmethod is a very subjective task. Therefore, manual eval-uation by humans is an important and inevitable elementin the performance analysis. In our experiments, we invitedfive assessors with both computer science and non-computerscience backgrounds to evaluate the performance of the pro-

posed spatiotemporal attention detection framework. Simi-lar to the evaluation strategy in [15], each assessor was askedto give a vote on how satisfactory he or she thinks the de-tected attended region is for each testing sequence. Thereare three types of satisfactions, good, acceptable and failed.Good represents the situations where the detected attendedregions/actions exactly match what the assessor thinks. Aspointed out by [15], it is somehow difficult to define the ac-ceptable cases. The reason is that different assessors havedifferent views even for the same video sequence. One at-tended region considered inappropriate by one assessor maybe considered perfect to another. In our experimental setup,if the detected attended regions in a video sequence do notcover the most attractive regions, but instead cover less in-teresting regions, the results are considered acceptable. Asdescribed by this definition, being acceptable is subjectiveto individual assessors. For instance, in the last examplein Figure 10, one assessor considers the walking person ismore interesting than the other two sitting people. Then,the current results shown is considered good to this assessor.However, another assessor may be attracted by the sittingpeople the most compared to the walking person. In thiscase, the result is considered acceptable to the second asses-sor.

We have performed the evaluation on both testing setswith three categories: (1) Testing Set 1 with moving objectsin the scene; (2) Testing Set 2 with detected attended points;

822

Data Set Good Acceptable Failed

Testing Set 1 (Moving Objects)

0.82 0.16 0.02

Testing Set 2 (Attended Points)

0.70 0.12 0.18

Testing Set 2 (Attended Regions)

0.80 0.12 0.08

System Performance Evaluation

Figure 12: System performance evaluation for three cat-

egories, Testing Set 1 with moving objects, Testing Set 2:

attended point detection and Testing Set 2: attended region

detection.

and (3) Testing Set 2 with detected attended regions in thescene. Figure 12 shows the assessment of all three categories.In this table, element in row M and column N representsthe proportion of the votes on category M with satisfactorylevel N . The assessment shown in the table demonstratesthat the proposed spatiotemporal attention detection frame-work is able to discover the interesting objects and actionswith more than 90% satisfaction rates. The results of at-tended point detection have a lower satisfaction rate thanthe other two region-level attention representations. Thereason for this is that, the attended regions possess contex-tual information among image pixels, and therefore, havericher contents in terms of semantic meanings than imagepixels. On the other hand, attended points are isolated fromeach other, and human perception system responds to themvery differently for different persons. Due to the lack ofsemantic meanings, more disagreements between assessorsemerged for the detected attended points, and therefore, haslowered the satisfactory score.

Another interesting observation from the experiments isthat, as the texture content in the imagery becomes richer,the attention detection performs with lower satisfactory rate.This is clearly shown in the results (Figure 11). For videosthat have prominent objects with relatively plain backgroundsettings, the proposed attention detection method performswell and produces very satisfied attended regions. On theother hand, if the background settings are much richer, somefalse detections are generated. This is actually similar tothe human vision system. As pointed by the psychologicalstudies in [6], human vision system is sensitive to the dif-ference or contrast between the target region and its neigh-borhood. In the situations where the background settingsare relatively uniform, the contrast between the object andthe background is larger. Thus, human vision system isable to pick up the target region very easily. On the otherhand, if rich background settings are present in the scene,the contrast between the object and the background is lesscomparing to the former cases, human vision system is dis-tracted by other regions in the scene and less capable to findthe target object.

We have also carried out a comparison experiment be-tween the conventional optical-flow based approach and ourproposed framework. The optical flow field is computedbased on the image blocks size of 16x16 pixels. There aretwo general problems with the optical-flow based approach.First, it is sensitive to the noise around the moving regionsin the scene. The pixels with noisy flow vectors have large

motion contrast compared to its neighbors, and therefore,suppresses the saliency value of the true moving object. Sec-ond, it produces pixel-level attention. If user wants to applythe attention results in tasks like object detection and classi-fication, further processing is needed to produce region levelattention. Figure 13 shows two comparison examples.

As described in Section 3, our proposed spatial attentiondetection method runs in a linear time order with respectto the number of image pixels. The main computation costof our temporal saliency model is when RANSAC is appliedfor the homography estimation. In RANSAC, homographiesare estimated in an iterative fashion, where the inlier set ineach round is increased if a better homography emerges. Inour experiments, there are 100 iterations in each round ofhomography estimation. By using the traditional four-pointalgorithm for homography estimation, the running time isacceptable for the attention detection task. For instance, itgenerally takes two seconds or less for computing the tem-poral saliency map between a pair of images using MatLabon a PC with a 1.7GHz processor.

6. CONCLUSIONS AND FUTUREDIRECTIONS

In this paper, we have presented a spatiotemporal atten-tion detection framework for detecting attention regions invideo sequences. The saliency maps are computed sepa-rately based on the temporal and spatial information of thevideos. In the temporal model, interest-point correspon-dences and geometric transformations between images areused to compute the motion contrast. In the spatial atten-tion model, we have presented a fast algorithm for comput-ing the pixel-level saliency map using color histograms ofthe video images. A hierarchical attention representationis established. Rectangular attended regions are initializedbased on the attended points. They are further iterativelyexpanded by analyzing the expansion potentials along theirsides. To achieve the spatiotemporal attention model, a dy-namic fusion technique is applied to combine the temporaland spatial models. The dynamic weights of the two indi-vidual models are controlled by the pseudo-variance of thetemporal saliency values. Extensive testing has been per-formed on numerous video sequences to demonstrate the ef-fectiveness of the proposed framework, and very satisfactoryresults have been obtained.

In our spatial attention model, we have incorporated coloras the contrast information. Based on the experimental re-sults, this cue works well. However, many times homogene-ity is not only defined by color, but also by other cues. In thenext stage of our research, we plan to incorporate the tex-ture information into the spatial attention model. Similar toour current color histogram based approach, distribution ofimage texture features will be constructed and used in thetexture contrast computation. Visual attention is widelyapplied in various fields as described in Section 1.1. Oneof them is the object/activity detection and classification,which is often considered as the top-down approach for at-tention detection. The top-down approaches utilize the priorknowledge of the targets. Another part of our plan will bethe combination of our bottom-up approach with the top-down technique to achieve both higher efficiency and accu-racy in object/activity classification tasks.

823

(a) (b) (c) (d) (e)

Figure 13: Comparison between the optical-flow based saliency computation and our proposed spatiotemporal at-

tention detection method. For each sequence, two consecutive images are given in (a) and (b). Figure (c) shows the

dense optical flow field using SSD block matching. Figure (d) shows the saliency map computed based on the contrast

between the optical flow vectors. Figure (e) shows the saliency map generated by our proposed method. In the results,

the performance of the optical-flow based method is greatly degraded by the noise around the moving objects.

7. ACKNOWLEDGEMENTSThis material is based upon the work funded in part by the

U.S. Government. Any opinions, findings and conclusions orrecommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the U.S.Government.

8. REFERENCES[1] J.C. Baccon, L. Hafemeister and P. Gaussier, “A

Context and Task Dependent Visual Attention Systemto Control A Mobile Robot ”, ICIRS, 2002.

[2] O. Boiman and M. Irani, “Detecting Irregularities inImages and in Video”, ICCV, 2005.

[3] L-Q. Chen, X. Xie, W-Y. Ma, H.J. Zhang and H-Q.Zhou, “Image Adaptation Based on Attention Modelfor Small-Form-Factor Devices”, ICMM, 2003.

[4] W-H. Cheng, W-T. Chu, J-H. Kuo and J-L. Wu,“Automatic Video Region-of-Interest DeterminationBased on User Attention Model”, ISCS, 2005.

[5] J.A. Driscoll, R.A. Peters II and K.R. Cave, “A visualattention network for a humanoid robot”, ICIRS, 1998.

[6] J. Duncan and G.W. Humphreys, “Visual Search andStimulus Similarity”, Psychological Review, 1989.

[7] L.L. Galdino and D.L. Borges, “A Visual AttentionModel for Tracking Regions Based on ColorCorrelograms”, Brazilian Symposium on ComputerGraphics and Image Processing, 2000.

[8] J. Han, K.N. Ngan, M. Li and H.J. Zhang, “TowardsUnsupervised Attention Object Extraction byIntegrating Visual Attention and Object Growing”,ICIP, 2004.

[9] Y. Hu, D. Rajan and L-T. Chia, “Adaptive LocalContext Suppression of Multiple Cues for SalientVisual Attention Detection”, ICME, 2005.

[10] L. Itti, C. Koch and E. Niebur, “A Model ofSaliency-Based Visual Attention for Rapid SceneAnalysis”, PAMI, 1998.

[11] T. Kohonen, “ A Computational Model of VisualAttention”, IJCNN, 2003.

[12] O. Le Meur, D. Thoreau, P. Le Callet and D. Barba,“A Spatio-Temporal Model of the Selective HumanVisual Attention”, ICIP, 2005.

[13] D.G. Lowe, “Distinctive Image Features From

Scale-Invariant Keypoints”, IJCV, 2004.

[14] Z. Lu, W. Lin, X. Yang, E.P. Ong and S. Yao,“Modeling visual attentions modulatory aftereffects onvisual sensitivity and quality evaluation”, T-IP, 2005.

[15] Y.F. Ma and H.J. Zhang, “Contrast-Based ImageAttention Analysis by Using Fuzzy Growing”, ACMMultimedia, 2003.

[16] R. Milanese, H. Wechsler, S. Gill, J.-M. Bost and T.Pun, “Integration of Bottom-Up and Top-Down Cuesfor Visual Attention Using Non-Linear Relaxation ”,CVPR, 1994.

[17] A. Nguyen, V. Chandrun and S. Sridharan, “VisualAttention Based ROI Maps From Gaze TrackingData”, ICIP, 2004.

[18] A. Oliva, A. Torralba, M.S. Castelhano and J.M.Henderson, “Top-Down Control of Visual Attention inObject Detection”, ICIP, 2003.

[19] N. Ouerhani and H. Hugli, “Computing VisualAttention from Scene Depth”, ICPR, 2000.

[20] N. Ouerhani, J. Bracamonte, H. Hugli, M. Ansorgeand F. Pellandini, “Adaptive Color Image CompressionBased on Visual Attention”, ICIAP, 2001.

[21] O. Oyekoya and F. Stentiford, “Exploring human eyebehaviour using a model of visual attention”,ICPR,2004.

[22] C. Peters and C.O. Sullivan, “Bottom-Up VisualAttention for Virtual Human Animation”, CASA, 2003.

[23] K. Rapantzikos, N. Tsapatsoulis and Y. Avrithis,“Spatiotemporal Visual Attention Architecture forVideo Analysis”, MMSP, 2004.

[24] U. Rutishauser, D. Walther, C. Koch, and P. Perona,“Is Bottom-Up Attention Useful for ObjectRecognition?”, CVPR, 2004.

[25] F. Stentiford, “A Visual Attention Estimator Appliedto Image Subject Enhancement and Colour and GreyLevel Compression”, ICPR, 2004.

[26] TRECVID BBC Rushes Data, TREC Video RetrievalEvaluation Forum, NIST, 2005.

[27] A. Treisman and G. Gelade, “A Feature-IntegrationTheory of Attention”, Cognitive Psychology, 1980.

[28] X.J. Wang, W.Y. Ma, and X. Li, “Data-drivenapproach for bridging the cognitive gap in imageretrieval”, ICME, 2004.

824

[acm press the 14th annual acm international conference - santa barbara, ca, usa...

Documents