hussain learning relevant eye movement feature spaces across users

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Learning Relevant Eye Movement Feature Spaces Across Users

Zakria Hussain∗

University College LondonKitsuchart Pasupa†

University of SouthamptonJohn Shawe-Taylor‡

University College London

Abstract

In this paper we predict the relevance of images based on a low-dimensional feature space found using several users’ eye move-ments. Each user is given an image-based search task, during whichtheir eye movements are extracted using a Tobii eye tracker. Theusers also provide us with explicit feedback regarding the relevanceof images. We demonstrate that by using a greedy Nystrom algo-rithm on the eye movement features of different users, we can finda suitable low-dimensional feature space for learning. We validatethe suitability of this feature space by projecting the eye movementfeatures of a new user into this space, training an online learning al-gorithm using these features, and showing that the number of mis-takes (regret over time) made in predicting relevant images is lowerthan when using the original eye movement features. We also plotRecall-Precision and ROC curves, and use a sign test to verify thestatistical significance of our results.

CR Categories: G.3 [Probability and Statistics]: Nonparametricstatistics, Time series analysis—; H.3.3 [Information Storage andRetrieval]: Information Search and Retrieval—Relevance feedback

Keywords: Feature selection, Eye movement features, Onlinelearning, Nystrom method, Tobii eye tracker

1 Introduction

In this study we devise a novel strategy of generalising eye move-ments of users, based on their viewing habits over images. Weanalyse data from a search task requiring a user to view images ona computer screen whilst their eye movements are recorded. Thesearch task is defined as finding images most related to bicycles,horses or transport and the user is required to indicate which im-ages on each page are relevant to the task. Using these eye move-ment features we run an online learning algorithm [Cesa-Bianchiand Lugosi 2006], such as the perceptron, to see how well we canpredict the relevant images. However, is it possible to improve theresults of the perceptron when taking into account the behaviour ofother users’ eye movements? We present some preliminary resultsthat answer this question in the affirmative.

The initial goal of this work was to use the Multi-Task FeatureLearning (MTFL) algorithm of [Argyriou et al. 2008]. However, inorder to use nonlinear kernels, the MTFL algorithm requires a low-dimensional mapping using a Gram-Scmidt, Eigen-decomposition,etc., in order to construct explicit features [Shawe-Taylor and Cris-tianini 2004]. However, simply using these low-dimensional map-pings, without applying the MTFL algorithm also generates a com-

∗Department of Computer Science, [email protected]†School of Electronics and Computer Science, [email protected]‡Department of Computer Science, [email protected]

mon feature space. Instead of the mappings mentioned above, wechose to employ the popular Nystrom approximation [Williams andSeeger 2000] using a greedy algorithm called sparse kernel princi-pal components analysis [Hussain and Shawe-Taylor 2009]. Afterthis low-dimensional representation we can project all new usersinto this space and run the perceptron algorithm. We show thatlearning in this new refined feature space allows us to predict rel-evant images early on in the search task, indicating that our low-dimensional feature space has helped construct a space which gen-eralises eye movements across several different users.

The paper is organised as follows. Sub-Sections 1.1 and 1.2 are thebackground techniques. Section 2 describes our novel method indetail, and the experiments and conclusions are given in Section 3and 4, respectively.

1.1 Background I: Online learning

We are given a sequence of examples (inputs) x ∈ Rn in n-dimensional space. These examples will be the different eye move-ment features extracted from the eye tracker. Furthermore, eachuser indicates relevance of images (i.e., outputs) y ∈ {−1, 1}where 1 indicates relevance and −1 indicates irrelevance.1 There-fore we will concentrate on the binary classification problem. Wewill also be given T ∈ N users, who produce anm-sample of input-output pairs St = {(xt1, yt1), . . . , (xtm, ytm)} where t ∈ T . Fi-nally, given any m-sample S we would like to construct a classifierhm : x 7→ {−1,+1} which maps inputs to outputs, where m indi-cates the number of examples used to construct the classifier hm.

Given a classifier h`(·) we can measure the regret incurred so far, bycounting the number of times our classifier has made an incorrectprediction. More formally, given an (x, y) pair, we say a mistakeis made if the prediction and true label disagree i.e., h`(x) 6= y.Therefore, given x1, . . . ,x` and a classifier computed using ` in-puts we have the regret:

regret(`) =Xi=1

I(h`(xi) 6= yi) (1)

where I(a) is equal to 1 if a is true and 0 otherwise. The regretcan be thought of as the cumulative number of mistakes made overtime.

The protocol used for online learning (see [Cesa-Bianchi and Lu-gosi 2006] for an introduction) is described below in Figure 1. The

• Initialise regret(0) = 0• Repeat for i = 1, 2, . . .

1. receive input xi and construct hypothesis hi2. predict output hi(xi)3. receive real output yi4. regret(i) = regret(i− 1) + I(hi(xi) 6= yi)

Figure 1: Online learning protocol

idea is to minimise the regret over time. We would like to make theleast number of mistakes throughout the online learning protocol.

1In practice all images not clicked as relevant are deemed irrelevant.

181

We will show in this paper that we can leverage useful informa-tion from other tasks to construct a low-dimensional feature spaceleading to our online learning algorithm having smaller regret thanwhen applied using individual samples. We will use the Perceptronalgorithm [Cesa-Bianchi and Lugosi 2006] in our experiments.

1.2 Background II: Low-dimensional mappings

The Gram-Schmidt orthogonalisation procedure finds an orthonor-mal basis for a matrix and is commonly used in machine learningwhen an explicit feature representation of inputs is required froma nonlinear kernel [Shawe-Taylor and Cristianini 2004]. However,Gram-Schmidt does not take into account the residual loss betweenthe low-dimensional mapping and may not be able to find a richenough feature space.

A related method, known as the Nystrom method, has been pro-posed and recently received considerable attention in the machinelearning community [Smola and Scholkopf 2000][Drineas and Ma-honey 2005][Kumar et al. 2009]. As shown by [Hussain andShawe-Taylor 2009] the greedy algorithm of [Smola and Scholkopf2000] finds a low-dimensional space that minimises the residualerror between the low-dimensional feature space and the originalfeature space. It corresponds to finding the maximum variance inthe data and hence viewed as a sparse kernel principal componentsanalysis. Due to these advantages, we will use the Nystrom methodin our experiments.

We will assume throughout that our examples x have already beenmapped into a higher-dimensional feature space F using φ : Rn 7→F . Therefore φ(x) = x, but we will make no explicit referenceto φ(x). Therefore, if our input matrix X contain features as col-umn vectors then we can construct a kernel matrix K = X>X ∈Rm×m using inner-products, where > denotes the transpose. Fur-thermore, we define each kernel function as κ(x,x′) = x>x′.

Let Ki = [κ(x1,xi), . . . , κ(xm,xi)]> denote the ith column vec-

tor of the kernel matrix. Let i = (1, . . . , k), k ≤ m, be an indexvector and let Kii indicate the square matrix constructed using theindices i. The Nystrom approximation K, which approximates thekernel matrix K, can be defined like so:

K = KiK−1ii K>i .

Furthermore, to project into the space defined by the index vectori we follow the regime outlined by [Diethe et al. 2009], where bytaking the Cholesky decomposition of K−1

ii to get an upper trianglematrix R = chol(K−1

ii ) where R>R = K−1ii , we can make a

projection into this low-dimensional space. Therefore, given anyinput xj we can create the corresponding new input zj using thefollowing projection:

zj = KjiR>. (2)

We have now described all of the components needed to describeour algorithm. However, it should be noted that the projection ofEquation (2) is usually only applied to a single sample S, whereasin our work we have T different samples S1, . . . , ST . We follow asimilar methodology described by [Argyriou et al. 2008] (who useGram-Schmidt) using the Nystrom method.

2 Our method

Given T − 1 input-output samples S1, . . . , ST−1 we have Xt =(xt1, . . . ,xtm)> where t = 1, . . . , T − 1. We concatenate thesefeature vectors into a larger matrix X = (X1, . . . ,XT−1) andcompute the corresponding kernel matrix K = X>X. This kernel

matrix contains information from all T−1 samples and so we applythe sparse kernel principal components analysis algorithm [Hussainand Shawe-Taylor 2009] to find the index set i (where |i| = k < m)needed to compute the projection of Equation (2). This equates tofinding a space with maximum variance between all of the differentuser behaviours. Given a new users eye movement feature vectorsXT we can project it into this joint low-dimensional mapping andrun any online learning algorithm in this space:

X>T XR>

where R = chol(K−1

ii). We choose an online learning algorithm

due to the speed, simplicity and practicality of the approach. Forinstance, when presenting users with images in a Content-BasedImage Retrieval (CBIR) system [Datta et al. 2008] we would liketo retrieve relevant images as quickly as possible. Although wedo not explicitly model a CBIR system, we maintain that a futuredevelopment towards applying our techniques within CBIR systemscould be useful in retrieving relevant images quickly. In the currentpaper we are more interested in the feasibility study of whether ornot any important information can be gained by the eye movementsof other users.

For practical purposes we will randomly choose a subset of inputsfrom each user in order to construct X – in practice this shouldnot effect the quality of the Nystrom approximation [Smola andScholkopf 2000]. However, for completeness we describe the al-gorithm without this randomness. The feature projection procedureis described below in Algorithm 1. Furthermore, based on these

Algorithm 1: Nystrom feature projection for multiple usersInput: training inputs X1, . . . ,XT−1, new inputs XT and

k ≤ m ∈ NOutput: new features zT1, . . . , zTm

1: concatenate all feature vectors to get

X = (X1, . . . ,XT−1)

2: Create a kernel matrix K = X>X3: Run the sparse KPCA algorithm of [Hussain and

Shawe-Taylor 2009] to find index set i such that k = |i|4: Compute:

R = chol(K−1

ii)

5: for j = 1, . . . ,m do6: construct new feature vectors using inputs in XT

zTj = x>TjXiR>, (3)

7: end for

feature vectors we also train the Multi-Task Feature Learning algo-rithm of [Argyriou et al. 2008] on the T − 1 samples by viewingeach as a task. The algorithm produces a matrix D ∈ Rδ×δ whereδ < m, which after taking the Cholesky decomposition of D wehave RD = chol(D), and hence a low-dimensional feature map-ping zTj = zTjRD . Unlike our Algorithm 1 this low-dimensionalmapping also utilises information about the outputs.

3 Experimental Setup

The database used for images was the PASCAL VOC 2007 chal-lenge database [Everingham et al. ] which contains over 9000 im-ages, each of which contains at least 1 object from a possible list

182

of 20 objects such as cars, trains, cows, people, etc. There wereT = 23 users in the experiments, and participants were asked toview twenty pages, each of which contained fifteen images. Theireye movements were recorded by a Tobii X120 eye tracker [TobiiStudio Ltd ] connected to a PC using a 19-inch monitor (resolutionof 1280x1024). The eye tracker has approximately 0.5 degrees ofaccuracy with a sample rate of 120 Hz and used infrared lens todetect pupil centres and corneal reflection. Figure 2 shows an ex-ample image presented to a user along with the eye movements (inred).

Figure 2: An example page shown to a participant. Eye movementsof the participant are shown in red. The red circles mark fixationsand small red dots correspond to raw measurements that belong tothose fixations. The black dots mark raw measurements that werenot included in any of the fixations.

Each participant was asked to carry out one search task among apossible of three, which included looking for a Bicycle (8 users), aHorse (7 users) or some form of Transport (8 users). Any imageswithin a search page that a user thought contained images fromthere search, they marked as relevant.

The eye movement features extracted from the Tobii eye trackercan be found in Table 1 of [Pasupa et al. 2009] and was collectedby [Saunders and Klami 2008]. Most of these are motivated byfeatures considered in the literature for text retrieval studies – how-ever, in addition, they also consider more image-based eye move-ment features. These are computed for each full image and basedon the eye trajectory and locations of the images in the page. Theycover the three main types of information typically considered inreading studies: fixations, regressions, and refixations. The numberof features extracted were 33, computed from: (i) raw measure-ments from the eye tracker; (ii) fixations estimated from the rawmeasurements using the standard ClearView fixation filter providedwith the Tobii eye tracking software, with settings of: “radius 50pixels, minimum duration 100 ms”. In the next subsection, these33 features correspond to the “Original features”. Finally, in orderto evaluate our methods with the perceptron we use the Regret (seeEquation (1)), the Recall-Precision (RP) curve and the ReceiverOperating Characteristic (ROC) curve.

3.1 Results

Given 23 users, we used the features of 22 users (i.e., leave-one-user-out) in order to compute our low-dimensional mapping usingthe Nystrom method outlined in Algorithm 1 – referred to as the

“Nystrom features”.2 We also made a mapping using the Multi-Task Feature Learning algorithm of [Argyriou et al. 2008], as de-scribed at the end of Section 2, which we refer to as the “Multi-Task features”. The baseline used the Original features extractedusing the Tobii eye tracker described above. The Original featuresdid not compute any low-dimensional feature space and was sim-ply used with the perceptron without using information from otherusers’ eye movements.

Figure 3 plots the regret, RP and ROC curves when training theperceptron algorithm with the Original features, the Nystrom fea-tures and the Multi-Task features. We average each measure overthe 23 leave-one-out users, and also over ten random splits of k fea-ture vectors needed to construct X. It is clear that there is a smalldifference between the Multi-Task features and Original features infavour of the Multi-Task features. However, the Nystrom featuresconstructed using the Nystrom method yield much superior resultswith respect to both of these techniques. The regret of the per-ceptron using the Nystrom features is shown in Figure 3(a), and isalways smaller than the Original and Multi-Task features (smallerregret is better). It is not surprising that we beat the Original fea-tures method, but our improvement over the Multi-Task features ismore surprising as it also utilises the relevance feedback in orderto construct a suitable low-dimensional feature space. However,the Nystrom method only uses the eye movement features. Fig-ure 3(b) plots the Recall-Precision curve and shows dramatic im-provements over the Original and Multi-Task features. This plotsuggests that using the Nystrom features allows us to retrieve morerelevant images at the start of a search. This would be importantfor a CBIR system (for instance). Finally, the ROC curves pic-tured in Figure 3(c) also suggest improved performance when usingthe Nystrom method to construct low-dimensional features acrossusers.

Algorithm Average Regret AAP AUROCOriginal features 0.2886 0.5687 0.6983Nystrom features 0.2433 0.6540 0.7662Multi-Task features 0.2752 0.5802 0.7107

Table 1: Summary statistics for curves in Figure 3. The bold fontindicates statistical significance at the level p < 0.01 over the base-line.

In Table 1 we state summary statistics related to the curves plottedin Figure 3. As mentioned earlier we randomly select k featuresfrom each data matrix Xt in order to construct X, and repeat thisfor ten random seeds – the results in Table 1 average over these tensplits. Therefore, for Figure 3(a), 3(b) and 3(c) we have the AverageRegret, the Average Average Precision (AP), and the Area Underthe ROC curve (AUROC), respectively. Using Nystrom features areclearly better than using the Original and Multi-Task features with asignificance level of p < 0.01. Though using Multi-Task features isslightly better than using the Original features, however, the resultsare only significant at the level p < 0.05. The significance levelof the direction of differences between the two measures was testedusing the sign test [Siegel and Castellian 1988].

4 Conclusions

We took eye gaze data of several users viewing images for a searchtask, and constructed a novel method for producing a good low-dimensional space for eye movement features relevant to the searchtask. We employed the Nystrom method and showed it to pro-duce better results than the extracted eye movement features [Pa-

2For the plots we use Nystroem.

183

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90

Samples

Cum

ulat

ive R

egre

t

Original featuresNystroem featuresMulti Task features

(a) The Regret

0 0.2 0.4 0.6 0.8 1

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

ision

Original featuresNystroem featuresMuti Task features

(b) The Recall-Precision curve

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive

True

Pos

itive

Original featuresNystroem featuresMulti Task features

(c) The ROC curve

Figure 3: Averaged over 23 users: results of the perceptron algorithm using Original, Nystrom and Multi-Task features.

supa et al. 2009] and a state-of-the-art Multi-Task Feature learningalgorithm. The results of this paper indicate that if we would liketo predict the relevance of images using the eye movements of auser, then it is possible to improve these results by learning relevantinformation from the eye movement behaviour of other users view-ing similar (not necessarily the same) images and given similar (notnecessarily the same) search tasks.

The first obvious application of our work is to CBIR systems. Byrecording the eye movement of users we could apply Algorithm 1 toproject gaze features into a common feature space, and then apply aCBIR system in this space. Based on the results we reported withinthis paper, we would hope that relevant images were retrieved earlyon in the search. Another application to CBIR, would be to use themethod we presented as a form of verification of relevance. Someusers may give incorrect relevance feedback due to tiredness, lackof interest, etc., but based on our method we could correct suchactions as and when users make judgements on relevance. As theperceptron would improve classification over time and we wouldexpect users to make mistakes near the end of a search, then suchcorrective action would be useful to CBIR systems requiring accu-rate Relevance Feedback mechanisms to improve search.

The machine learning view is that we are projecting the feature vec-tors into a common Nystrom space (constructed from feature vec-tors of different users), and then running the perceptron with thesefeatures. Our experiments suggest better performance than simplyusing the perceptron together with the original features. Therefore,we would like to construct regret bounds for the perceptron whenusing this common low-dimensional space. Finally, for practicalpurposes we believe that randomly selecting columns from the ker-nel matrix to find i would generate a feature space [Kumar et al.2009] with little deterioration in test accuracy, but leading to com-putational speed-ups.

Acknowledgements

The research leading to these results has received fundingfrom the European Community’s Seventh Framework Programme(FP7/2007–2013) under grant agreement n◦ 216529, Personal In-formation Navigator Adapting Through Viewing, PinView.

References

ARGYRIOU, A., EVGENIOU, T., AND PONTIL, M. 2008. Convexmulti-task feature learning. Machine Learning 73, 3, 243–272.

CESA-BIANCHI, N., AND LUGOSI, G. 2006. Prediction, Learn-ing, and Games. Cambridge University Press, New York, NY,USA.

DATTA, R., JOSHI, D., LI, J., AND WANG, J. Z. 2008. Imageretrieval: Ideas, influences, and trends of the new age. ACMComputing Surveys 40, 5:1–5:60.

DIETHE, T., HUSSAIN, Z., HARDOON, D. R., AND SHAWE-TAYLOR, J. 2009. Matching pursuit kernel fisher discriminantanalysis. In Proceedings of 12th International Conference onArtificial Intelligence and Statistics, AISTATS, vol. 5 of JMLR:W&CP 5.

DRINEAS, P., AND MAHONEY, M. W. 2005. On the Nystrommethod for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153–2175.

EVERINGHAM, M., VAN GOOL, L., WILLIAMS, C. K. I.,WINN, J., AND ZISSERMAN, A. The PASCAL Visual ObjectClasses Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

HUSSAIN, Z., AND SHAWE-TAYLOR, J. 2009. Theory of match-ing pursuit. In Advances in Neural Information Processing Sys-tems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou,Eds. 721–728.

KUMAR, S., MOHRI, M., AND TALWALKAR, A. 2009. Sampletechniques for the Nystrom method. In Proceedings of 12th In-ternational Conference on Artificial Intelligence and Statistics,AISTATS, vol. 5 of JMLR: W&CP 5.

PASUPA, K., SAUNDERS, C., SZEDMAK, S., KLAMI, A., KASKI,S., AND GUNN, S. 2009. Learning to rank images from eyemovements. In HCI ’09: Proceeding of the IEEE 12th Interna-tional Conference on Computer Vision (ICCV’09) Workshops onHuman-Computer Interaction, 2009–2016.

SAUNDERS, C., AND KLAMI, A. 2008. Database ofeye-movement recordings. Tech. Rep. Project DeliverableReport D8.3, PinView FP7-216529. available on-line athttp://www.pinview.eu/deliverables.php.

SHAWE-TAYLOR, J., AND CRISTIANINI, N. 2004. Kernel Meth-ods for Pattern Analysis. Cambridge University Press, Cam-bridge, U.K.

SIEGEL, S., AND CASTELLIAN, N. J. 1988. Nonparametric statis-tics for the behavioral sciences. McGraw-Hill, Singapore.

184

SMOLA, A., AND SCHOLKOPF, B. 2000. Sparse greedy matrixapproximation for machine learning. In Proceedings of the 17thInternational Conference on Machine Learning, 911–918.

TOBII STUDIO LTD. Tobii Studio Help.http://studiohelp.tobii.com/StudioHelp 1.2/.

WILLIAMS, C. K. I., AND SEEGER, M. 2000. Using the Nystrommethod to speed up kernel machines. In Advances in NeuralInformation Processing Systems, MIT Press, vol. 13, 682–688.

185

hussain learning relevant eye movement feature spaces across users

Documents