kandemir inferring object relevance from gaze in dynamic scenes

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Inferring Object Relevance from Gaze in Dynamic Scenes

Melih Kandemir∗

Helsinki University of TechnologyDepartment of Information

and Computer Science

Veli-Matti Saarinen†

Helsinki University of TechnologyLow Temperature Laboratory

Samuel Kaski‡

Helsinki University of TechnologyDepartment of Information

and Computer Science

Abstract

As prototypes of data glasses having both data augmentation andgaze tracking capabilities are becoming available, it is now possi-ble to develop proactive gaze-controlled user interfaces to displayinformation about objects, people, and other entities in real-worldsetups. In order to decide which objects the augmented informationshould be about, and how saliently to augment, the system needsan estimate of the importance or relevance of the objects of thescene for the user at a given time. The estimates will be used tominimize distraction of the user, and for providing efficient spa-tial management of the augmented items. This work is a feasibilitystudy on inferring the relevance of objects in dynamic scenes fromgaze. We collected gaze data from subjects watching a video fora pre-defined task. The results show that a simple ordinal logisticregression model gives relevance rankings of scene objects with apromising accuracy.

CR Categories: H.5.r [Information Interfaces and Representation(HCI)]: User interfaces—User interface management systems

Keywords: augmented reality, gaze tracking, information re-trieval, intelligent user interfaces, machine learning, ordinal logisticregression

1 Introduction

In this paper, we develop a method needed for doing informationretrieval in dynamic real-world scenes where the queries are for-mulated implicitly by gaze. In our setup the user wears a ubiqui-tous information access device, “data glasses” having eye-trackingand information augmentation capabilities. The device is assumedto be capable of recognising and tracking certain types of objectsfrom the first-person video data of the user Figure 1 illustrates theidea. Some objects, three faces and the whiteboard in this image,are augmented with attached boxes that include textual informationobtained from other sources. In such a setup, each visible objectin a scene can be considered as a channel through which additionalrelevant information can be obtained as augmented on the screen.As in traditional information retrieval setups such as text search en-gines, potential abundance of available information brings up theneed for a mechanism to rank the channels with respect to their rel-evance. This is particularly important in proactive mobile setupswhere the augmented items are also potential distractors.

Our goal is to infer the degree of interest of the user for the objects

∗e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected]

Figure 1: A screenshot from the eyesight of hypothetical dataglasses with augmented reality capability during a short presen-tation in a meeting room (Scene 1)

in the scene. This problem has a connection to modelling of vi-sual attention [Henderson 2003; Itti et al. 1998; Zhang et al. 2008];whereas visual attention models typically try to predict the gaze pat-tern given the scene, our target is the inverse problem of inferringthe user’s state (interests) given the scene and the gaze trajectory.A good solution for the former problem would obviously help inour task too, but current visual attention models mainly consideronly physical pointwise saliency which does not yet capture themainly top-down nature of effects of user’s interest on the gaze pat-tern. Although there exists some initial attempts towards two-waysaliency modeling [Torralba et al. 2006], these are evaluated onlyfor rather trivial visual tasks such as counting a certain type of ob-jects in static images. Unlike top-down models where the modelis optimised given a well-defined search task, the cognitive task ofthe subject in our setup is hidden and may even be unclear to thesubject herself. Hence, we start by data-driven statistical machinelearning techniques for the inverse modeling task.

Gaze data has been used in user interfaces in three ways. Our goal isthe furthest from the most frequent approach, off-line analysis, forinstance studying effectiveness of advertisements in attracting peo-ple’s attention, or analysis of social interaction. In the second ap-proach the user selects actions by explicitly looking at the choices,for instance eye typing [Hyrskykari et al. 2000; Ward and MacKay2002]. Although such explicit selection mechanisms are easy to im-plement, they require full user attention and are strenuous becauseof the Midas touch effect: each glance activates an action whetherit is intended or not. The third way of using gaze data in user in-terfaces is implicit feedback. The user uses her gaze normally, andinformation needed by the interface is inferred from the gaze data.An emerging example is proactive information retrieval where sta-tistical machine learning methods are used for inferring relevancefrom gaze patterns. The inferred relevance judgements are thenused as implicit relevance feedback for information retrieval. Thishas been done for text retrieval by generating implicit queries from

105

gaze patterns [Hardoon et al. 2007]. The same principle has beenused for image retrieval as well [Klami et al. 2008], recently alsocoupled dynamically to a retrieval engine in an interactive zoominginterface [Kozma et al. 2009]. Gaze has additionally been used asa means of proactive interaction, but not information retrieval, in adesktop application by assigning a relevance function to the entitieson a synthetic 2D map [Qvarfordt and Zhai 2005].

To test the feasibility of the idea of relevance ranking from gaze indynamical real-world setups, we prepared a stimulus video and col-lected gaze data from subjects watching that video. True relevancerankings were then asked from the subjects in several frames. Wetrained an ordinal logistic regression model and measured its accu-racy in the relevance prediction task on the left-out data.

2 Measurement Setup

We shot a video from the first-person view of a subject visiting threeindoor scenes. Then we postprocessed this video by augmentingsome of the objects with additional textual information in an at-tached box. This video was shown to 4 subjects and gaze data wascollected. Right after the viewing session the subjects ranked thescene objects in relevance order for a subset of the video frames.The ranking was considered as the ground truth for learning themodels and evaluating them. The modelling task is to predict theuser-given ranking for an object given the gaze-tracking data froma window immediately preceding the ranked frame.

3 Model for Inferring Relevance

Let us index the stimulus slices preceding each relevance judgementfrom 1 toN . We extract a feature vector (details in the Experimentssection) for each scene object i at time slice t to obtain a single un-labelled data point: fi(t) = {f (t)

i1 , f(t)i2 , · · · , f

(t)id } where d is the

number of features. If we also attach the ground truth relevanceranking ri(t), we get a labelled data point (fi(t), ri(t)). Let us de-note the set of data points, one for each object, related to time slicet as a data subset Λ(t) = {(f1(t), r1

(t)), · · · , (fmt(1), rmt

(1))}where mt is the number of visible objects at time slice t. Letus denote the data subset without labels by Λ′(t), and the maxi-mum number of visible objects by L = max({m1, · · · ,mN}).For notational convenience, we define the most relevant object tohave rank L, and the rank decreases as relevance decreases. Thewhole labelled data set consists of the union of all data subsets∆ = {Λ(1),Λ(2), · · · ,Λ(N)}.

We search for a mapping from the feature space to the space ofrelevances, which is conventionally [0, 1]. Such a mapping can di-rectly be achieved using ordinal logistic regression [McCullagh andNelder 1989] if we assume that the relevance of an object dependsonly on its features, and it is independent of the relevance of theother visible objects. We use the standard approach as describedbriefly below.

Let us denote the probability of the object rank to be k as P (ri(t) =

k | f(t)i ) = φk(f(t)i ). Then we can define the log odds such that theproblem reduces to a batch of L − 1 binary regression problems,one for each k = 1, 2, · · · , L− 1:

Mk = log(

P (ri(t)<=k | f(t)

i )

1−P (ri(t)<= | f(t)

i )

)= log

(φ0(f(t)

i )+φ1(f(t)i )+···+φk(f(t)

i )

φk+1(f(t)i )+φk+2(f(t)

i )+···+φL(f(t)i )

)= w

(k)0 + wf(t)i

where a linear model is assumed. By taking the exponent of both

sides we get the CDF of the rank distribution for object i at time t:

P (ri(t) <= k | f(t)i ) =

exp(w(k)0 + wf(t)i )

1 + exp(w(k)0 + wf(t)i )

.

Notice that we adopted the standard approach and used commonslope coefficients w = [w1, · · · , wd] for all logit models but differ-ent intercepts w(k)

0 . In the training phase, we calculate the maxi-mum likelihood estimates for the parameters θ of this model (θ =

{w(1)0 , · · · , w(k−1)

0 , w1, · · · , wd}) using the Newton-Raphson tech-nique. Given an unlabelled data subset Λ′(t) at time t, the objectwith relevance rank k is predicted to be the one that has the highestprobability for that rank; arg maxi φk(f(t)i ).

4 Experiments

4.1 Stimulus Preparation

We shot a video clip of 4 minutes and 17 seconds long from the first-person view of a subject, using a see-through head mounted displaydevice. In the scenario of the clip, a visitor coming to our laboratoryis informed of our research project. The scenario consists of threeconsecutive scenes:

1. A short presentation in a meeting room: A researcher in-troduces the project with a block diagram drawn on the white-board (Figure 1) in a meeting room. People present are askingquestions. The visitor follows the presentation.

2. A walk in the lab corridor: The visitor walks through thelaboratory taking a look at posters on the wall, and zoomsinto some of the name tags on office doors.

3. Demo of data collection devices: The host introduces howeye tracking experiments are made. He demonstrates a mon-itor with eye tracking capabilities and the head-mounted dis-play device.

Next, we augmented the video by attaching information boxes toobjects; such as faces, the whiteboard, name tags, posters, and de-vices related to the project. These were considered to be the objectspotentialls the most interesting to the visitor. Short snippets of tex-tual information relevant to the objects were displayed inside theboxes. At most one information box was attached to any one objectat a time. We displayed boxes for all visible objects. There werefrom 0 to 4 objects in the scene at a time; average number of sceneobjects was 2.017 with 1.36 standard deviation. The frame rate ofthe postprocessed video was 12 fps.

4.2 Data Collection

We collected gaze data from 4 subjects while they were watchingthe stimulus video to get as much information as they can aboutthe research project. After the viewing session, the subjects wereshown 154 screenshots from the video in temporal order, each ofwhich represent a 1.66 seconds slot (20 frames). The users wereasked to select the objects that were relevant to them at that mo-ment, and also to rank the selected subset of objects according totheir relevance. We defined relevance as the interest in seeing aug-mented information about an object in the scene at that particulartime. All subjects assured, after ranking, that they were able toremember the correct ranks for almost all the frames. The sub-jects were graduate and postgraduate researchers not working onthe project related to the study we present in this paper.

106

4.3 The Eye Tracker

We collected the gaze data with a Tobii 1750 eye tracker with 50Hzsample rate. The tracker has an infra-red stereo camera on a stan-dard flat-screen monitor. The device performs tracking by detectingthe pupil centers and measuring the reflection from the cornea. Thesuccessive gazes that were located within an area of 30 pixels areconsidered as a single fixation. This corresponds to approximately0.6 degrees of deflection at a normal viewing distance to an 17”-screen monitor with 1280 × 1024 pixel resolution. Test subjectswere sitting 60 cm away from the monitor.

4.4 Feature Extraction

We extracted from the gaze and video data a set of features cor-responding to each visible object. This was done at every timeslice for which the labelled object ranks were available (i.e., forone frame in every 20 consecutive frames). Each of these featuressummarises a particular aspect in the temporal context (recent past).We define the context at time t to be a slot from time point t −Wto t − 1 where W is a predetermined window size. We used thefollowing 11 features:

1. mean area of the bounding box of the object

2. mean area of the information box attached to the object

3. mean distance between the centers of the object bounding box and the attachedinformation box

4. total duration of fixations inside the bounding box of the object

5. total duration of fixations inside the information box attached to the object

6. mean duration of fixations inside the bounding box of the object

7. mean duration of fixations inside the information box attached to the object

8. mean distance of all fixations to the center of the object bounding box

9. mean distance of all fixations to the center of the information box

10. mean length of saccades that ended up with fixations inside the bounding boxof the object

11. mean length of saccades that ended up with fixations inside the information boxattached to the object

We marked the bounding boxes of the objects manually frame byframe.

4.5 Evaluation

We evaluated the accuracy of the model with respect to the propor-tion of times the most relevant object was predicted correctly. Wecompared the model performance with five baseline methods. Thefirst one is random guessing, in which at each time slice, sceneobjects are ranked uniformly at random. The second one is anattention-based method that assigns a relevance proportional to thetotal fixation duration on the object and on the augmented content.This estimate of object relevance is referred to as gaze intensity[Qvarfordt and Zhai 2005]. This is used to reveal the effect of in-tricate gaze patterns, other than mere visual attention measured bygaze intensity in relevance prediction. In the third baseline modelwe used the ordinal logistic regression model with the features thatare not related to gaze: first three of the features. Thus we investi-gated the effect of gaze-based features in prediction accuracy. Wedefined two more baseline models that depend on Itti et al.’s bottom-up visual attention model [Itti et al. 1998] in order to observe howuseful such plain attention modelling is in our problem setup, andto test if our model provides better accuracy. We computed the Itti-Koch saliency map of the labelled frames. Then we calculated therelevance of an object as the maximum saliency inside its bounding

box for one baseline model, and as the average saliency inside thebounding box for the other one.

We trained separate models for user specific and user independentcases. In the user-specific case, we trained and tested the model onthe data of the same subject. We splitted the dataset into trainingand validation sets by random selection without replacement. Werandomly selected 2/3 of the dataset for training and left out theremainder for testing. We repeated this process 50 times and mea-sured the mean prediction accuracy. We computed the accuracy forseveral window sizes, starting from 50 frames and increasing un-til 750 frames with 25-frame steps. Our model outperformed all theother baseline methods for all subjects and all window sizes (Figure2). The significance of the difference was tested for each subjectseparately using Wilcoxon signed-rank method with α=0.05. Wemade the test between our model and three best performing base-lines; the logit model without gaze features and the two saliencybased models. We selected the window sizes for our model and thelogit model without gaze features with respect to average predictionaccuracy on the training data.

Figure 2: User-specific model accuracy for one user. Sub-imagesshow the accuracy (proportion of correct predictions) as a func-tion of the context window size (in frames, x-axis). Red diamond:our proposed model, blue circles: baseline model using only thevideo features (not gaze), green reversed triangles: attention-onlymodel, cyan squares: random guessing, black triangles: maximumsaliency inside object, pink crosses: average saliency inside object.

In the user-independent case, we left out one user and trained themodel with the whole datasets of the other users. Then we evalu-ated the accuracy on the data of the left out user. This procedurewas repeated for all users. The results gave the same conclusionsas in the user-specific case although with some decrease in the ac-curacy for all the metrics and insignificance of outperformance forsome test subjects. This is probably due to the increase in the degreeof uncertainty originating from subjectivity of top-down cognitiveprocesses. Then a single common model may be inadequate to han-dle the variability of gaze patterns across the subjects. This issueneeds to be investigated further.

The box plot in Figure 3 (a) shows the learned regressor weightsfor a subject in the user-specific case. Small variance of weightsindicates that the model is stable across different splits. Both themagnitude and the ordering of weights in the user-independent case

107

was very similar to the user-specific case.

The best accuracy is achieved at the longish window sizes (i.e.525 frames in the user-specific case, and 300 frames in the user-independent case for test subject 1). This supports the claim thatthe context does contain information related to object relevances.The decrease in accuracy as the window size further increases isnot very significant, and in particular the proposed model seems tobe insensitive to window size.

The feature that makes the highest positive influence on relevance isthe mean distance between the object center and the fixations withinthe context (w8). Intuitively, the relevance of an object increasesas the fixations within the context get closer to the center of thatobject. The feature that has the highest negative influence is themean distance between the object and the box. This means thatas the information box is placed closer to the object, it takes moreinterest. Some of the weights are harder to interpret and we willstudy them further in our subsequent research.

Figure 3: Variance of the regressor weights for each of the featuresamong different bootstrap trials in the user-specific model. Thefeatures are nubmered in Section 4.4

5 Discussion

In this work, we assessed the feasibility of a gaze-based objectrelevance predictor in real-world scenes where the scene objectswere augmented with additional information. For this, we applieda rather simple ordinal logistic regression model over a set of gazepattern and visual content features. The prominent increase in ac-curacy when the gaze pattern features are added to the feature setreveals that gaze statistics and visual features make a mutually com-plementary contribution to relevance inference. The optimal way ofcombining these two sources of information should be further stud-ied. The outperformance of our model over the bottom-up attentionmodel in predicting the most relevant object can be attributed to thatthe bottom-up models are incapable of reflecting the task-dependentcontrol of attention.

A better performance can probably be achieved by enriching thefeature set and using a more complex model that better fits to thedata. Generalisation of the model for other real-world scenes alsoneeds to be investigated further. This can be done by plugging themodel into a wearable information access device and assessing itsperformance during online use. Such assessment of our model iscurrently under progress.

6 Acknowledgements

Melih Kandemir and Samuel Kaski belong to the Finnish Centerof Excellence in Adaptive Informatics and Helsinki Institute for In-formation Technology (HIIT). Samuel Kaski also belongs to PAS-CAL2 EU network of excellence. This study is funded by TKKMIDE project UI-ART.

References

HARDOON, D., SHAWE-TAYLOR, J., AJANKI, A., PUOLAMAKI,K., AND KASKI, S. 2007. Information retrieval by inferring im-plicit queries from eye movements. In International Conferenceon Artificial Intelligence and Statistics (AISTATS ’07).

HENDERSON, J. M. 2003. Human gaze control during real-worldscene perception. Trends in Cognitive Sciences 7, 11, 498 – 504.

HYRSKYKARI, A., MAJARANTA, P., AALTONEN, A., ANDRAIHA, K.-J. 2000. Design issues of ’idict’: A gaze-assistedtranslation aid. In Proceedings of ETRA 2000, Eye Tracking Re-search and Applications Symposium, ACM Press, ACM Press,9–14.

ITTI, L., KOCH, C., AND NIEBUR, E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans-actions on Pattern Analysis and Machine Intelligence 20, 11,1254–1259.

KANDEMIR, M., SAARINEN, V.-M., AND KASKI, S. 2010. In-ferring object relevance from gaze in dynamic scenes. In To Ap-pear in Short Paper Proceedings of ETRA 2000, Eye TrackingResearch and Applications Symposium.

KLAMI, A., SAUNDERS, C., DE CAMPOS, T. E., AND KASKI, S.2008. Can relevance of images be inferred from eye movements?In MIR ’08: Proceeding of the 1st ACM international confer-ence on Multimedia information retrieval, ACM, New York, NY,USA, 134–140.

KOZMA, L., KLAMI, A., AND KASKI, S. 2009. GaZIR: Gaze-based zooming interface for image retrieval. In Proc. ICMI-MLMI 2009, The Eleventh International Conference on Multi-modal Interfaces and The Sixth Workshop on Machine Learningfor Multimodal Interaction, ACM, New York, NY, USA, 305–312.

MCCULLAGH, P., AND NELDER, J. 1989. Generalized LinearModels. Chapman & Hall/CRC.

QVARFORDT, P., AND ZHAI, S. 2005. Conversing with the userbased on eye-gaze patterns. In CHI ’05: Proceedings of theSIGCHI conference on Human factors in computing systems,ACM, New York, NY, USA, 221–230.

TORRALBA, A., OLIVA, A., CASTELHANO, M. S., AND HEN-DERSON, J. M. 2006. Contextual guidance of eye movementsand attention in real-world scenes: the role of global features inobject search. Psychological Review 113, 4, 766–786.

WARD, D. J., AND MACKAY, D. J. C. 2002. Fast hands-freewriting by gaze direction. Nature 418, 6900, 838.

ZHANG, L., TONG, M. H., MARKS, T. K., SHAN, H., AND COT-TRELL, G. W. 2008. Sun: A bayesian framework for saliencyusing natural statistics. Journal of Vision 8, 7 (12), 1–20.

108

kandemir inferring object relevance from gaze in dynamic scenes

Documents