nakayama estimation of viewers response for contextual understanding of tasks using features of eye...

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Estimation of Viewer’s Response for Contextual Understanding of Tasksusing Features of Eye-movements

Minoru Nakayama∗

CRADLE(The Center for R & D of Educational Technology)

Tokyo Institute of Technology

Yuko HayashiHuman System Science

Tokyo Institute of Technology

Abstract

To estimate viewer’s contextual understanding, features of theireye-movements while viewing question statements in response todefinition statements, and features of correct and incorrect re-sponses were extracted and compared. Twelve directional featuresof eye-movements across a two-dimensional space were created,and these features were compared between correct and incorrect re-sponses. The procedure of estimating the response was developedwith Support Vector Machines, using these features. The estima-tion performance and accuracy were assessed across combinationsof features. The number of definition statements, which neededto be memorized to answer the question statements during the ex-periment, affected the estimation accuracy. These results provideevidence that features of eye-movements during reading statementscan be used as an index of contextual understanding.

CR Categories: H.1.2 [User/Machine Systems]: Human informa-tion processing; H.5.2 [User Interfaces]: Evaluation/methodology

Keywords: eye-movements, answer correctness, eye-movementmetrics, user’s response estimation, discriminant analysis

1 Introduction

“Contextual understanding” is the awareness of knowledge and pre-sented information, including texts, images and other factors. Todiscern a person’s contextual understanding of something, ques-tions are generally given in order to observe the responses. Even inthe human-computer interaction (HCI) environment, various sys-tems ask users about their contextual understanding. To improvethese systems and make them more environmentally effective forusers and designers, an index for the measurement of contextualunderstanding is desirable, and should be developed to ferret outproblems regarding HCI. Eye-movements can be used to evaluatedocument relevance [Puolamaki et al. 2005], and to estimate usercertainty [Nakayama and Takahasi 2008]. Results such as thesesuggest the possibility that features of eye-movements can estimateviewer responses to questions which are based on contextual under-standing and certainty. Estimation techniques using eye-movementmetrics have already been applied to Web page assessments [Ehmkeand Wilson 2007; Nakamichi et al. 2006]. The effective features ofresponse estimation have not yet been determined, however.

To conduct an estimation of viewer’s responses using eye-movements, the appropriate features need to be extracted. Jacob

∗e-mail: [email protected]

& Karn summarized eye tracking related metrics and effectivenesswhile subjects completed various tasks [Jacob and Karn 2003]. Ad-ditionally, Ryner summarized features of eye-movements in thereading process [Rayner 1998]. Most metrics are based on fixa-tion and saccade during a specific task, and are scalar, not dimen-sional. Therefore, high-level eye-movement metrics are required,and some of these have already been proposed [Duchowski 2006].Also, the features of eye-movement metrics across two-dimensionalspace have also been discussed, as eye-movements can be illus-trated in two dimensions [Tatler 2007; Tatler et al. 2007]. Theauthors’ preliminary analysis of the inferential task suggests thatestimations using several features and a liner function are useful,but subjects’ performance is not sufficient enough to be analyzedin depth [Nakayama and Hayashi 2009]. To improve this perfor-mance, two approaches will be considered. The first is the creationof features of eye-movements which are effective at understandingthe viewer’s behavior. The second is that a more robust estimationprocedure be created using non-linear functions such as SupportVector Machines (SVM) [Stork et al. 2001]. Also, performance as-sessment procedures are used to emphasize the significance of theestimation [Fawcett 2006].

This paper addresses the feasibility of estimating user responsecorrectness of inferential tasks using various features of eye-movements made while the user selects alternative choices basedon their contextual understanding of statements presented to them.

2 Experimental method

The subjects were first asked to understand and memorize somedefinition statements which described locational relationships be-tween two objects (Figure 1(a)). Each definition statement was pre-sented for 5 seconds. Then, ten questions in statement form weregiven to determine the degree of understanding. These questionsasked subjects to choose one of two choices as quickly as possi-ble, regarding whether each question statement was “Yes (True)”or “No (False)” (Figure 1(b)). Each question statement was shownfor 10 seconds. When the subject responded to a question state-ment, the display moved to the next task. All texts were written inJapanese Kanji and Hiragana characters, and the texts are read fromleft to right. This task asked subjects for “contextual understand-ing”. The number of statements, which were given to subjects, wascontrolled at 3, 5 or 7 per set, as the task difficulty. Five sets werecreated for each statement level. In total, 150 responses per subjectwere gathered (3 levels × 5 sets × 10 questions). The experimen-tal sequence was randomized to prevent subjects from experiencingany learning effect. The subjects were 6 male university studentsranging from 23 to 33 years of age. They had normal visual acuityfor this experiment.

The task was displayed on a 20 inch LCD monitor positioned60 cm from the subject. During the experiment, a subject’seye-movements were observed using a video-based eye tracker(nac:EMR-8NL). The subject rested his head on a chin rest and asmall infra-red camera was positioned between the subject and themonitor, 40 cm from the subject. Blink was detected using an as-

53

(a) (b)

Figure 1: Leftside (a) shows a sample of a definition statement: “Atheater is located on the east side of the police station.” Rightside(b) shows a sample of a question statement: “There is a post officeon the south side of the theater.”

pect ratio of the two diameters. During blink, the lack of eye track-ing data was compensated for by the use of a simple, previouslyused procedure [Nakayama and Shimizu 2004]. The tracker wascalibrated at the beginning of the session, and eye-movement wastracked on a 640 by 480 pixel screen at 60 Hz. The accuracy of thespatial resolution of this equipment is noted in the manufacturer’scatalog as being a visual angle of 0.1 degrees.

The tracked eye-movement data was extracted for the duration oftime subjects viewed each question statement before the mouse but-ton was pressed. While differences between the captured view-ing positions were calculated, eye-movements were divided intosaccades and gazes using a threshold of 40 degrees per second[Ebisawa and Sugiura 1998]. The two-dimensional distributionof eye-movements have not been considered as features of eye-movements, although several studies have used these factors in theirresearch [Tatler 2007; Tatler et al. 2007]. Therefore, features offixation and saccade were mapped in 12 directions using 30 degreesteps, in order to present a two-dimensional distribution. Four typesof features were summarized for each time the statements wereviewed, as follows: the fixation position as the distance in degreesfrom the center of the screen, fixation duration, saccade length, andsaccade duration. These are 12-dimensional vectors, and they canalso be noted as a scalar of the means of the components.

3 Results

3.1 Viewer’s response

The subject’s responses were classified as correct (hits and correctrejections) or incorrect (misses and false alarms) according to thecontext of the question statement. There was a unique answer forevery question, because the question statements were generated us-ing a rule of logic . The reaction time was also measured in mil-liseconds for all responses. The accuracy of the responses acrossthe number of statements is summarized in Figure 2. The accuracydecreases with the total number of statements. According to the re-sults of one-way ANOVA on the accuracy, the factor of the numberof statements is significant (F (2, 10) = 16.5, p < 0.01). The ac-curacy for a total of 3 statements is significantly higher than for theothers (p < 0.01), but there is no significant difference between theaccuracy for 5 and 7 statements (p = 0.21), however. This suggeststhat the number of statements can be used to control the difficulty ofthe task, and that the task is easiest for 3 statements and hardest forboth 5 and 7 statements. Mean reaction times for both correct andincorrect responses were also illustrated in Figure 2. There are sig-nificant differences in reaction times between correct and incorrectresponses (F (1, 5) = 109.1, p < 0.01). The factor of the numberof statements is not significant (F (2, 20) = 0.6, p = 0.54). Thissuggests that reaction time is a key factor of response correctness.

0

0.5

1

Correct

Incorrect

Acc

urac

y

3 5 7

Number of statements

4

2

3

1

0

Tim

e (s

ec.)

Accuracy

Figure 2: Mean accuracy and reaction times for correct and incor-rect responses across the number of definition statements.

0

30

6090

120

150

180

210

240270

300

330

6 deg

*

*

*

* *

**

*

IncorrectCorrect

p<0.01*

Figure 3: Mean fixation position for 12 directions.

3.2 Feature differences between responses

Extracted features of eye movements for question statements, sac-cade length, differences in saccade length, saccade frequency andsaccade duration, were compared between correct and incorrect re-sponses. The distribution of fixation points is illustrated in Figure3. The figure shows fixation points covering a horizontal area, inparticular on the right-hand side. In this experiment, single sen-tences were written horizontally, so that subjects viewed them ac-cording to the outline. In Japanese, verbs and negation are writtenat the ends of sentences, so that readers may confirm the relation-ship between the subject and object in the sentence and whetherit is a positive or negative statement. When comparing positionsbetween correct and incorrect responses, significant differences areillustrated with an asterisk mark (*) on the axis (p < 0.01). Thereare significant differences between the 150 to 330 degree directionsand the 60 degree direction. For all cases, mean positions for in-correct responses were longer than those for correct responses. Asmost differences appear on the left-hand side, this suggests that sub-ject’s fixation points stayed at the beginning of a statement whenhe made an incorrect response. They might have had some trou-ble starting reading. Subjects also viewed a wider area when theymade incorrect responses than when they made correct responses.The distribution of the fixation durations for each direction is simi-lar to the mean position from the center of Figure 3. The durationson the right hand side are relatively longer than for the other di-rections. This means that viewers’ eye movements stayed in thisarea, at the end of the sentence. When comparing the duration be-tween correct and incorrect responses, the results for the very righthand side are different from the others. Mean durations for correctresponses are significantly longer than durations for incorrect re-sponses in the direction of 0 degrees. For other directions, such asupward, left-ward and in the directions of 300 degrees, mean du-rations for incorrect responses are longer than mean durations forcorrect responses. This means that the distribution of the duration

54

0

30

6090

120

150

180

210

240270

300

330

5 deg.

*

*

*

*

*

* *

*

IncorrectCorrect

* p<0.01

Figure 4: Mean saccade length across 12 directions.

for correct responses shifts towards the right. These metrics showsome of the required indices such as the visual area of coverage andvisual attention [Duchowski 2006].

Mean saccade lengths in visual angles are summarized across 12directions in Figure 4. The mean of the saccade lengths is spreadmore widely along the horizontal axis. In particular, saccadelengths in horizontally opposite directions are the longest, and theirlengths are almost equal. This behavior shows that subjects care-fully read the question statements. Also, it may depend on the useof Japanese grammar, because the subject term and the verb areseparated horizontally in the text. When comparing the lengths be-tween correct and incorrect responses, the mean saccade length inthe reverse direction (180 degrees) is longer for correct responsesthan it is for incorrect responses. For several other directions (0,90, 150, 210, 300, and 330 degrees), the mean saccade lengths forincorrect responses are longer than the saccade lengths for correctresponses, however. Mean saccade durations clearly show that themean durations for incorrect responses are definitely longer thanthose for correct responses. Though the mean indicates the durationfor a single saccade, overall means are quite different, and there aresignificant differences for all directions between correct and incor-rect responses. This suggests that saccadic movement seems to beslower when the viewer’s responses are incorrect.

3.3 Estimation of Answer Correctness

The significant differences in eye movement features between re-sponses were summarized in the above sections. These results sug-gest that if there is the possibility of estimating responses usingviewer’s eye movement patterns before their decisions are made,then this possibility should be determined. Here, the hypothesisis that there is a relationship between “correct” or “incorrect” re-sponses and the acquired features of eye-movements for a questionstatement. Feature vectors of eye-movements are noted as V , al-ternative responses are noted as t, and the acquired data can benoted as (V , t) for each question statement. In this section, the per-formance of the estimation is determined using various features ofeye movements. First of all, all extracted features (24+24 dimen-sions, V24+24) of eye movements, such as fixation and saccade, areapplied to a discrimination function. For the estimation function,support vector machines (SVM) are used for this analysis becauseSVM is quite robust for high dimensionality features and poorly de-fined feature fields [Stork et al. 2001]. Here, a sign function whichis based on the SVM function is defined as G and based on a Gaus-sian kernel. The parameters and functions can be noted as follows:

t ∈ {+1(correct),−1(incorrect)}t = G(V24+24), t ∈ {+1(correct),−1(incorrect)}

The optimization for the function G(V ) was conducted using LIB-SVM tools [Chang and Lin 2008]. For the SVM, the penalty pa-

Table 1: Discrimination results for a condition (f(24)s(24): No. ofstatements=3).

Subject’s Estimation [t]response [t] Correct Incorrect Total

Correct 158 49 207Incorrect 25 68 93

Total 183 117 300

rameter of the error term C for the soft margin, and the γ parameteras a standard deviation of the Gaussian kernel should be optimized.To extract validation results, the leave-one-out procedure was ap-plied to the estimation. Training data consisting of the data of allsubjects except the targeted subject was prepared. Both the train-ing, and the estimation of responses was then conducted. Thesewere tallied, and the mean performance was evaluated. As a result,the estimation results for three statements are summarized in Table1, for all features (t vs. t). The rate for correct decisions consistingof hits and correct rejections is 76.3%. This discrimination per-formance is significant according to the binomial distribution. Theestimation performance is often evaluated using an ROC (ReceiverOperating Characteristics) curve, which is based on signal detectiontheory [Fawcett 2006]. LIBSVM tools can provide a probability ofthe discrimination [Chang and Lin 2008], and then the ROCs arecreated for each level of statements using the probability [Fawcett2006]. Furthermore, the validation performance of the discrimi-nation is conducted using AUC (Area Under the Curve). The AUCvaries between 0 and 1, but the performance is better when the valueapproaches 1. When AUC is near 0.5, the performance is at thechance level. Other feature sets of eye-movements, such as fixa-tion and saccades, were applied to the same estimation procedure.Estimation performances and AUCs for a number of statements aresummarized in Table 2. The 12 features for fixations consist of thefixation positions across 12 directions. The 13 features for fixa-tion consist of the 12 fixation positions plus a scalar of the fixationduration, and the 24 features consist of the 12 fixation positionsand the 12 fixation durations. For saccades, the 12 features consistof the saccade lengths across 12 directions. The 13 features con-sist of the 12 saccade lengths plus a scalar of the saccade duration,and the 24 features consist of the 12 saccade lengths and the 12saccade durations. As references, performances using a combina-tion of scalar features were calculated. The combination “A” showsthe performance when selected features (four features of saccades)are applied to the estimation [Nakayama and Hayashi 2009]. An-other combination “B” shows the performance when another setof selected features for all saccades (four features) [Nakayama andTakahasi 2008] is applied to the estimation. The estimation proce-dure is based on the previous study. Both feature sets of “A” and“B” do not include the reaction time factor in this paper, becausethe factor affected the performance.

As a result of the estimations in Table 2, the best performance isobtained using all features of fixation and saccade across 12 direc-tions. According to the table, the estimation performance using sac-cade features is higher than the performance using fixation features.When the estimation was conducted using fixation or saccade fea-tures, the performance was practically independent of the numberof feature dimensions. A combination of features including fixa-tion and saccade gives the highest performance. When two sets ofselected features “A” and “B” were applied to the estimation, theperformance was not significant. A few estimations were less than50% accurate. This result provides evidence that the directionalinformation across 12 directions is quite significant. The AUC met-rics are also the highest when the estimation was conducted usingfixation or saccade features. Table 1 shows the rate of false alarms,which is the number of correct responses estimated for incorrect

55

Table 2: Estimation accuracy using feature vectors.Combination Fixation Saccade Fixation+Saccade

Tasks A B f(12) f(13) f(24) s(12) s(13) s(24) f(12)s(12) f(12)s(13) f(24)s(24)Estimation accuracy

3 68.3 43.0 65.0 65.7 73.0 77.3 78.7 76.0 76.0 75.7 76.35 50.3 55.7 68.3 68.7 65.3 65.7 65.3 64.0 66.7 68.7 70.77 44.3 61.0 67.6 67.0 61.6 67.0 66.0 66.7 68.3 67.7 68.7M 54.3 53.2 67.0 67.1 66.6 70.0 70.0 68.9 70.3 70.7 71.9

AUC: Area under a curve3 0.37 0.75 0.72 0.73 0.81 0.80 0.82 0.80 0.83 0.84 0.835 0.44 0.70 0.74 0.73 0.73 0.70 0.71 0.71 0.75 0.76 0.767 0.42 0.71 0.75 0.73 0.71 0.75 0.73 0.73 0.77 0.78 0.78M 0.41 0.72 0.73 0.73 0.75 0.75 0.75 0.75 0.78 0.79 0.79

(13): 12 features of vector information plus the scalar of the duration.A: mean saccade length, mean differential ratio, saccade frequency, mean succade duration [Nakayama and Hayashi 2009]B: saccade length, dx, dy, saccade duration of every saccade [Nakayama and Takahasi 2008]

responses. When the number of false alarms is smaller the AUCsare higher. Additionally, both the estimation accuracy and the AUCmetric for the three statement conditions are higher than for otherconditions with 5 and 7 statements. According to Figure 2, the re-sponse accuracy for three statements is significantly higher than forthe other conditions. This response accuracy may affect both theestimation accuracy and the AUCs.

4 Summary

To estimate viewer’s contextual understanding using features ofeye-movements, features were extracted and compared betweencorrect and incorrect responses when alternative responses to ques-tion statements concerning several definition statements were of-fered. Twelve directional features of eye-movements across a two-dimensional space were created: fixation position, fixation dura-tion, saccade length and saccade duration. In a comparison of thesefeatures between correct and incorrect responses, there were signif-icant differences in most features. This shows evidence that fea-tures of eye-movements reflect the viewer’s contextual understand-ing. An estimation procedure using Support Vector Machines wasdeveloped and applied to the experimental data. The estimation per-formance and accuracy were assessed across several combinationsof features. When all extracted features of eye-movements wereapplied to the estimation, the estimation accuracy was 71.9 % andthe AUC was 0.79. The number of definition statements affectedestimation performance and accuracy.

References

CHANG, C., AND LIN, C., 2008. Libsvm: A library for sup-port vector machines (last updated: May 13, 2008). Available21 July 2009 at URL: http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

DUCHOWSKI, A. T., 2006. High-level eye movement metrics in theusability context. Position paper, CHI2006 Workshop: Getting aMeasure of Satisfaction from Eyetracking in Practice.

EBISAWA, Y., AND SUGIURA, M. 1998. Influences of target andfixation point conditions on characteristics of visually guidedvoluntary saccade. The Journal of the Institute of Image Infor-mation and Television Engineers 52, 11, 1730–1737.

EHMKE, C., AND WILSON, S. 2007. Identifying web usabil-ity problems from eye-tracking. In Proceedings of HCI 2007,British Computer Society, L. Ball, M. Sasse, C. Sas, T. Ormerod,A. Dix, P. Bagnall, and T. McEwan, Eds.

FAWCETT, T. 2006. An introduction to roc analysis. Pattern Recog-nition Letters 27, 861–874.

JACOB, R. J. K., AND KARN, K. S. 2003. Eye tracking in human–computer interaction and usability research: Ready to deliver thepromises. In The Mind’s Eye: Cognitive and Applied Aspectsof Eye Movement Research, Hyona, Radach, and Deubel, Eds.Elsevier Science BV, Oxford, UK.

NAKAMICHI, N., SHIMA, K., SAKAI, M., AND ICHI MAT-SUMOTO, K. 2006. Detecting low usability web pages usingquantitative data of users’ behavior. In Proceedings of the 28thInternational Conference on Software Engineering (ICSE’06),ACM Press.

NAKAYAMA, M., AND HAYASHI, Y. 2009. Feasibility study forthe use of eye-movements in estimation of answer correctness.In Proceedings of COGAIN2009, A. Villanueva, J. P. Hansen,and B. K. Ersboell, Eds., 71–75.

NAKAYAMA, M., AND SHIMIZU, Y. 2004. Frequency analysis oftask evoked pupillary response and eye-movement. In Eye Track-ing Research and Applications Symposium 2002, ACM Press,New York, USA, S. N. Spencer, Ed., ACM, 71–76.

NAKAYAMA, M., AND TAKAHASI, Y. 2008. Estimation of cer-tainty for responses to multiple-choice questionnaires using eyemovements. ACM TOMCCAP 5, 2, Article 14.

PUOLAMAKI, Y., SALOJARVI, J., SAVIA, E., SIMOLA, J., ANDKASKI, S. 2005. Combining eye movements and collabora-tive filtering for proactive information retrieval. In Proceedingsof ACM-SIGIR 2005, ACM Press, New York, USA, A. Heikkil,A. Pietik, and O. Silven, Eds., ACM, 145–153.

RAYNER, K. 1998. Eye movements in reading and informationprocessing: 20 years of research. Psychological Bulletin 124, 3,372–422.

STORK, D. G. R., DUDA, O., AND HART, P. E. 2001. PatternClassification, 2nd ed. John Wiley & Sons, Inc. Japanese transla-tion by M. Onoue, New Technology Communications Co., Ltd.,Tokyo, Japan (2001).

TATLER, B. W., WADE, N. J., AND KAULARD, K. 2007. Ex-amining art: dissociating pattern and perceptual influences onoculomotor behaviour. Spatial Vision 21, 1-2, 165–184.

TATLER, B. W. 2007. The central fixation bias in scene viewing:Selecting and optimal viewing position independently of motorbiases and image feature distributions. Journal of Vision 7, 14,1–17.

56

nakayama estimation of viewers response for contextual understanding of tasks using features of eye...

Documents