zhang eye movement as an interaction mechanism for relevance feedback in a content based image...

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Eye Movement as an Interaction Mechanism for Relevance Feed-back in a Content-Based Image Retrieval System

Yun Zhang*1,2 1School of Computer Science

Northwestern Polytechnical Uni-versity, Xi’an, Shaanxi, China

Hong Fu†2, Zhen Liang

‡2, Zheru Chi§2

2Centre for Multimedia Signal Processing Department of Electronic

and Information Engineering The Hong Kong Polytechnic University

Hong Kong, China

Dagan Feng¶2,3, 3School of Information Technologies

The University of Sydney Sydney, Australia

Abstract

Relevance feedback (RF) mechanisms are widely adopted in Content-Based Image Retrieval (CBIR) systems to improve image retrieval performance. However, there exist some intrinsic problems: (1) the semantic gap between high-level concepts and low-level features and (2) the subjectivity of human perception of visual contents. The primary focus of this paper is to evaluate the possibility of inferring the relevance of images based on eye movement data. In total, 882 images from 101 categories are viewed by 10 subjects to test the usefulness of implicit RF, where the relevance of each image is known beforehand. A set of measures based on fixations are thoroughly evaluated which include fixation duration, fixation count, and the number of revi-sits. Finally, the paper proposes a decision tree to predict the user’s input during the image searching tasks. The prediction precision of the decision tree is over 87%, which spreads light on a promising integration of natural eye movement into CBIR systems in the future. CR Categories: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Relevance feedback, Search Process; H.5.2 [Information Interfaces and Representation]: User interfaces Keywords: Eye Tracking, Relevance Feedback (RF), Content-Based Image Retrieval (CBIR), Visual Perception

1 Introduction Numerous digital images are being produced everyday from digital cameras, medical devices, security monitors, and other image capturing apparatus. It has become more and more diffi-cult to retrieve a desired picture even from a photo album on a home computer because of the exponential increase in the num-ber of images. Most traditional and common methods of im-age retrieval based on metadata, such as textual annotations or user-specified tags, have become the industry standard for re-trieval from large image collections. However, manual image annotation is time-consuming, laborious and expensive. Moreo-

ver, the subjective nature of human annotation adds another dimension of difficulty in managing image database.

CBIR is an alternative solution to retrieve images. However, after years of rapid growth since 1990s [Flickner et al.1995], the gaps between low level features and semantic contents of images holds back the progress and has entered a plateau phase. Such gaps can be concretely outlined into three aspects: (1) image representation (2) similarity measure (3) user’s interaction. Most of the image representations are based on intuitiveness of the researchers and the fulfillment of mathematics, instead of hu-man’s eye behavior. Do the features extracted reflect humans’ understanding of the image’s content? There is no clear answer to this question. Similarity measure is highly dependent on the features and structures used in image representation. Moreover, developing better distance descriptors and refining similarity measures are also very challenging. User interaction can be a feasible approach to answer the question and to improve the image retrieval performance. In the Relevance Feedback (RF) process, the user is asked to refine the searching by providing explicit RF, such as selecting Areas-of-Interest (AOIs) from the query image, or to tick positive and negative samples from re-trieves. In the past few years, many articles reported that RF can help to establish the association between the low-level features and the semantics of images and to improve the retrieval per-formance [Liu et al.2006; Dacheng Tao et al.2008].

However, the explicit feedback is laborious for the user and limited in complexity. In this paper, we propose an eye move-ment based implicit feedback as a rich and natural source to replace the time-consuming and expensive explicit feedback. As far as we know, there are just a few preliminary studies on im-plementing some general eye movement features in image re-trieval. One is from Oyekoya and Setntiford’s work [Oyekoya and Stentiford.2004; Oyekoya and Stentiford.2006]. They made an investigation into the fixation duration and found that they are different on images with/without a clear AOI. The other work was reported by Klami et al. [Klami et al. 2008]. They proposed nine-feature vectors from different forms of fixations and saccades and used a classifier to predict one relevant image from four candidates.

Different from the previous work, the study reported in this pa-per attempts to simulate a more real and complex image retrieval situation and to quantitatively analyze the correlation between users’ eye behavior and target images (positive images). In our experiments, the images come from a wide variety of web sources, and in each task, the query image and the numbers of positive images are varied from time to time. We evaluated the significance of fixation durations, fixation counts, and the num-ber of revisits to provide a systematic interoperation of the us-er’s attention and effort allocation in eye movements, laying a

*email: [email protected]: †email:[email protected] ‡email:[email protected] ‖email:[email protected] §email: [email protected]

37

concrete and substantial foundation to involve natural eye movement as a robust RF source [Zhou and Huang. 2003].

The rest of the paper is organized as follows. Section 2 introduc-es experimental design and setting for relevance feedback tasks and the corresponding eye movement data collecting. In Section 3, we report our thorough investigation on using fixation dura-tion, fixation count and the numbers of revisits for the prediction of relevant images. These factors are performed with the ANO-VA test to reveal their significances and interconnections. Sec-tion 4 proposes a decision tree model to predict the user’s input during the images searching tasks. Finally, we conclude the results and propose the future work.

2 Design of Experiments

2.1 Task Setup

We study an image searching task which reflects kinds of activi-ties occurring in a complete CBIR system. In total, 882 images are randomly selected from 101 object categories. The image set is obtained by collecting images through the Google image search enginee [Li 2005]. The design and examples of the searching task interface is shown in Fig. 1. On the top left is the query image. Twenty candidate images are arranged as a 4x5 grid display. All of the images are from 101 categories such as landscapes, animals, buildings, human faces, and home ap-pliances. The red blocks in Fig. 1(a) denotes the locations of positive images in Fig. 1(b) (Class No. 22: Pyramid). The others are negative images and their image classes are different from each other. That is to say, apart from the query image’s category, no two images in the grid are from the same category. The can-didate images in one searching stimulus are randomly arranged.

Figure 1. Image searching stimulus. (a) the layout of the search-ing stimulus with 5 positive images; (b) an example. Such a simulated relevance feedback task asks each participant to use his eye to locate the positive image on each stimulus. On locating the positive image, the participants select the target by fixating on it for a short period of time with the eye. A set of the task are composed of 21 such stimulus whose positive image number are varied from 0 to 20. Thus, the set of task contains 21x21 = 441 images and the total number of the negative images and positive images are equal (210 images each).

2.2 Apparatus and Procedure

Eye tracking data is collected by the Tobii X120 eye tracker, whose accuracy is α 0.5° and drift β 0.3°. Each candidate image has a resolution of 300 x 300 pixels and thus an image stimulus has 1800 x 1200 pixels. Each of stimuli is displayed on the screen with a viewing distance of 600 mm and the screen’s resolution is 1920x1280 pixels and the pixel pitch is h = 0.264 mm. Hence the output uncertainty is just R tan α β /h = 30 pixels, which has ensured the error of gaze data no larger than 1% area of each candidate image.

Ten participants took part in the study, four females and six males in an age range from 20 to 32 all with an academic back-ground. All of them are proficient computer users, and half of them have had experience of using an eye tracking system. Their visions are either normal or correct-to-normal. The participants were asked to complete two sets of the above mentioned image searching tasks and the gaze data are recorded with a 60 Hz sampling rate. Afterwards the participants were asked to indicate which images they have chosen as positive images to ensure the accuracy of a further analysis on their eye movement data. The eye tracker is non-intrusive and allows a 300x220x300 mm free head movement space. Different candidate images and the loca-tions of positive images are ensured in and between each set of the task. In other words, no two images are the same and no two stimuli have the same positive image locations. This is to reduce the memory effects and to simulate the natural relevance feed-back situation.

3 Analysis of Gaze Data in Image Searching

Raw gaze data are preprocessed by finding the fixations with the built-in filter provided by Tobii Technology. The filter maps a series of raw coordinates to a single fixation if the coordinates stay sufficiently long within a sphere of a given radius. We used an interval threshold of 150 ms and a radius of 1 º visual angle.

3.1 Fixation Duration and Fixation Count

The main features used in eye tracking related information re-trieval are fixations and saccades [Jacob and Karn.2003]. Two groups of derived metrics stem from the fixation: fixation dura-tion and fixation count are thoroughly studied to support the possibility of inferring the relevance of images based on eye movements [Goldberg et al.2002; Gołofit 2008]. Suppose that FDP(m) and FDN(m) are the fixation durations on the positive and the negative images observed by subject m, respectively; FCP(m) and FCN(m) are the fixation counts on the positive and the negative images observed by subject m, respectively; Then in our searching task, FDP(m) and FDN(m) are defined as

FDP =∑ FD sgn ,, ,

∑ sgn ,, ,

FDN =∑ FD 1 sgn ,, ,

∑ 1 sgn 1,, ,

where 0,1, … ,20 denotes the image candidate in each searching stimulus interface; 1,2, … ,21 denotes the stimulus in each searching task (it also represents the numbers of positive images in the current stimulus); 1,2 denotes the task set,

1,2, … ,10 represents the subject and sgn(x) is the signum function. Consequently, FD is the fixation duration on the i-th image candidate of the j-th stimulus of the k-th task from subject m, and

,1 if subject regards cadidate image as positive 0 if subject regards cadidate image as negative

In the similar manner, FCP(m) and FCN(m) are defined as

FCP =∑ FC sgn ,, ,

∑ sgn ,, ,

FCN =∑ FC 1 sgn ,, ,

∑ 1 sgn ,, ,

where FC is the fixation counts on the i-th image candi-date of the j-th stimulus of the k-th task from subject m. The two pairs of fixation-related variables were monitored and recorded

Query Image

Class No 01

Negative

Class No 22

Positive

Class No 22

Positive

Class No 75

Negative

Class No 64

Negative

Class No 56

Negative

Class No 38

Negative

Class No 17

Negative

Class No 100

Negative

Class No 12

Negative

Class No 45

Negative

Class No 22

Positive

Class No 06

Negative

Class No 77

Negative

Class No 91

Negative

Class No 13

Negative

Class No 69

Negative

Class No 22

Positive

Class No 22

Positive

Class No 28

Negative

(1)

(2)

(a) (b)

38

during the experiment. The average value and standard deviation of ten participants are summarized in Table 1.

Table 1 Statistics on the fixation duration and fixation count on positive and negative images

Analysis of variance (ANOVA) tests are performed to find out whether there are discriminating visual behaviors between the observation of positive and negative images. Given the individu-al difference in eye movements, we designed two groups of two-way ANOVA among three factors: test subject, fixation duration and fixation count. The results are shown in Table 2.

Table 2 ANOVA test results among three factors: test subject, fixation duration and fixation count.

As illustrated in Table 2, both fixation duration and fixation count revealed significant effects to positive and negative im-ages during simulated relevance feedback tasks. Concretely speaking, the fixation durations on each positive image from all the subjects (1.30 seconds) are longer than those on negative image (0.35seconds). Correspondingly, the analysis of fixation count produces similar results that subjects visit more times on a positive image (2.8) than on a negative one (1.3). On the other hand, the variations of different subjects have no significant effects on both groups. (In GROUP I, 0.37 > α = 0.05; in GROUP II, 0.15 > α = 0.05).

3.2 Number of Revisits

A revisit is defined as the re-fixation on an AOI previously fix-ated. Much human computer interaction and usability research shows that re-fixation or revisit on a target may be an indication of special interest on the target. Therefore, the analysis of revisit during the relevance feedback process may reveal the correlation between the eye movement pattern and positive image candi-dates. Figure 2 shows a general status of the overall visit frequency (no. of revisits = no. of visits - 1) throughout the whole image search-

ing task. We can see that (1) some of the candidate images are never visited, which indicates the use of pre-attentive vision at the very beginning of the visual search [Salojärvi et al. 2004]. During the pre-attentive process, all the candidate images have been examined to decide the successive fixation locations; and (2) in our experiments, revisits happen both on positive images and negative images. The majority of them have just been vi-sited once, while some of them are revisited during the image searching.

Figure 2 The total revisit histogram. The X-axis denotes the number of re-fixation and Y-axis is the corresponding count (unit: millisecond).

Table 3 Overall revisits on positive and negative images

A1 1 2 3 4 5 6 >7

A2 549 196 88 55 34 13 27

A3 329 110 31 10 3 2 1

A4 878 306 119 65 37 15 28

A5 63% 64% 74% 85% 92% 87% 100%A1 = the number of revisits on an image candidate; A2 = revisit counts on positive images; A3 = revisit counts on negative im-ages; A4 = the total number of revisits; A5 = the percentage of the total revisits occurring to positive images.

To compare with Oyekoya and Setntiford’s work [2006], we investigate whether the variance of revisit counts has a different effect between positive and negative image candidates over all the participants (as shown in Table 3). When revisits counts ≥ 3 times, the result of one-way ANOVA is significant with F(1,8) = 5.73, p < 0.044. That is to say, the probability of revisits on a positive image is increased with revisits counts. For example, when an image is revisited more than three times, it has a very high probability (over 74%) to be a positive image candidate. As a result, the number of revisit is also a feasible implicit relev-ance feedback to drive an image retrieval engine.

4 Feature Extraction and Results The primary focus of this paper is on evaluating the possibility of inferring the relevance of images based on eye movement data. The features such as fixation duration, fixation count and the number of revisit have shown discriminating power between positive and negative images. Consequently, we composed a simple set of 11 features , , … , , an eye movement’s vector to predict the positive images from each returned 4x5 image candidates set in the simulated relevance feedback task, where 1,2, … ,20 denotes the numbers of positive images in the current stimulus; 1,2, … ,10 represents the subject , , … , are listed in Table 4, where

1, … ,20 and FL FD /FC .

Table 4 Features used in relevance feedback to predict positive images

403

2149

878

306119 65 80

0

500

1000

1500

2000

2500

No Visit 1 times 2 times 3 times 4 times 5 times > 6 times

The Number of Visits ‐‐ Histogram

Sub. FDP(m) FDN(m) FCP(m) FCN(m)

1 1.410±1.081 0.415±0.481 2.5±1.9 1.3±1.32 1.332±0.394 0.283±0.247 2.7±1.4 1.2±0.93 2.582±1.277 0.418±0.430 5.6±3.3 1.7±1.54 0.805±0.414 0.356±0.328 2.4±1.2 1.5±1.25 1.154±0.484 0.388±0.284 2.6±1.4 1.5±1.06 1.880±0.926 0.402±0.338 3.0±1.9 1.4±1.07 0.987±0.397 0.166±0.283 1.7±0.8 0.6±0.78 0.704±0.377 0.358±0.254 2.2±1.1 1.3±0.99 1.125±0.674 0.329±0.403 3.0±2.0 1.4±1.5

10 1.101±0.444 0.392±0.235 2.7±1.3 1.5±0.8

AVG. 1.308±0.891 0.351±0.345 2.8±2.0 1.3±1.1

GROUP I

Factor Levels Test result (A) Test Subjects

10 levels (10 subjects)

F(9,9) = 1.26, p < 0.37

(B) Fixation Duration

2 levels (FDP & FDN)

F(1,9) = 32.84, p < 0.0003

GROUP II Factor Levels Test result

(A) Test Subjects

10 levels (10 subjects)

F(9,9) = 2.03, p < 0.15

(B) Fixation Count

2 levels (FCP & FCN)

F(1,9) = 28.28, p < 0.0005

39

NO. Features Description

FD Fixation duration on i-th image inside 4x5 image

candidate set interface

FC Fixation Count on i-th image inside 4x5 image

candidate set interface

FL FL FD /FC

Fixation Length on i-th image inside 4x5 image candidate set interface

R

Revisit numbers happened on i-th image inside 4x5 image candidate set interface

Different from Klami et al.’s work [Klami et al. 2008], we use a decision tree (DT) as a classifier to automatically learn the pre-diction rules. The data set mentioned in Section 2 is divided into a training and a testing sets to evaluate the prediction accuracy. Two different methods are used to train the DT, which are illu-strated in Table 5 (prediction precisions are 87.3% and 93.5%, respectively), and an example of predicted positive image from 4x5 candidates set is shown in Figure 3.

Table 5 Training methods and testing results of decision trees Method I

Training Data Set 1,2, … 5

Testing Data Set 5,6, … 10

Prediction Precision 87.3%

Method II

Training Data Set 1,3,5 … 19

Testing Data Set 2,4,6 … 20

Prediction Precision 93.5%

Figure 3 An example of predicted positive images from 4x5 candidates set in the simulated relevance feedback task. The query image is “hedgehog”, and DT model returned 8 predicted positive images (in red frames) based on the 11 features vector with 100% accuracy.

5 Conclusion and Further Work

An eye tracking system can be possibly integrated into a CBIR system as a more efficient input mechanism for implementing the user’s relevance feedback process. In this paper, we mainly concentrate on a group of fixation- related measurements which shows static eye movement patterns. In fact, the dynamic cha-racteristics can also manifest human organizational behavior and decision processes, such as saccades and scan path, which reveal the pre-attention and cognition process of a human being while viewing an image. In our further work, we will continue to de-velop a more comprehensive study which includes both the stat-ic and dynamic features of eye movements. Originally, it is a unity of human’s conscious and unconscious visual cognition behavior, which can not only be used in relevance feedback, but also a new source of image representation. Human’s image viewing automatically bridge the low level features, such as

color, texture, shape, and spatial information, to human attention, such as AOIs. As a result, eye tracking data can be a rich and new source for improving image representation [Lei Wu et al. 2009]. Our future work is to develop an eye tracking based CBIR system in which human beings’ natural eye movements will be effectively exploited and used in the modules of image representation, similarity measurement and relevance feedback.

Acknowledgments

The work reported in this paper is substantially supported by the Research Grants Council of the Hong Kong Special Administra-tive Region, China (Project code: PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9).

References

DACHENG TAO, XIAOOU TANG AND XUELONG LI. 2008. Which Components are Important for Interactive Image Searching Circuits and Systems for Video Technology, IEEE Transactions on 18, 3-11. .

FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY, J., HUANG, Q., DOM, B., GORKANI, M., HAFNER, J., LEE, D., PETKOVIC, D., STEELE, D. AND YANKER, P. 1995. Query by Image and Video Content: The QBIC System. Computer 28, 23-32. .

GOLDBERG, J.H., STIMSON, M.J., LEWENSTEIN, M., SCOTT, N. AND WICHANSKY, A.M. 2002. Eye tracking in web search tasks: design implications. In ETRA '02: Proceedings of the 2002 symposium on Eye tracking research \& applications, New Orleans, Louisiana, Anonymous ACM, New York, NY, USA, 51-58.

GOŁOFIT, K. 2008. Click Passwords Under Investigation. Computer Security - ESORICS 2007 343-358. .

JACOB, R. AND KARN, K. 2003. Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises. In The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Re-search, HYONA, RADACH AND DEUBEL, Eds. Elsevier Science, Oxford, England.

KLAMI, A., SAUNDERS, C., DE CAMPOS, T.E. AND KASKI, S. 2008. Can relevance of images be inferred from eye movements? In MIR '08: Proceeding of the 1st ACM international conference on Multimedia information retrieval, Vancouver, British Columbia, Canada, Anonym-ous ACM, New York, NY, USA, 134-140.

LEI WU, YANG HU, MINGJING LI, NENGHAI YU AND XIAN-SHENG HUA. 2009. Scale-Invariant Visual Language Modeling for Object Categorization. Multimedia, IEEE Transactions on 11, 286-294. .

FEIFEI. LI ,Visual recognition: computational models and human psy-chophysics, Phd Thesis, California Institute of Technology, 2005.

LIU, D., HUA, K., VU, K. AND YU, N. 2006. Fast Query Point Move-ment Techniques with Relevance Feedback for Content-Based Image Retrieval. Advances in Database Technology - EDBT 2006 700-717. .

OYEKOYA, O. AND STENTIFORD, F. 2004. Exploring Human Eye Behaviour using a Model of Visual Attention. 17th International Confe-rence on (ICPR'04) Volume 4, IEEE Computer Society, Washington, DC, USA, 945-948.

OYEKOYA, O. AND STENTIFORD, F. 2006. Perceptual Image Retriev-al Using Eye Movements. Advances in Machine Vision, Image Processing, and Pattern Analysis 281-289. .

SALOJÄRVI, J., PUOLAMÄKI, K. AND KASKI, S. 2004. Relevance feedback from eye movements for proactive information retrieval. In Workshop on Processing Sensory Information for Proactive Systems (PSIPS 2004, Anonymous , 14-15.

ZHOU, X.S. AND HUANG, T.S. 2003. Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems 8, 536-544. .

40

zhang eye movement as an interaction mechanism for relevance feedback in a content based image...

Documents

image database

image representations

relevant image email

relevance of images

image of visual contents

manual image work

search query image

images content