looking at faces: autonomous perspective invariant facial gaze...

8
Looking at Faces: Autonomous Perspective Invariant Facial Gaze Analysis Justin K. Bennett * Srinivas Sridharan Brendan John Reynold Bailey Rochester Institute of Technology Abstract Eye-tracking provides a mechanism for researchers to monitor where subjects deploy their visual attention. Eye-tracking has been used to gain insights into how humans scrutinize faces, however the majority of these studies were conducted using desktop-mounted eye-trackers where the subject sits and views a screen during the experiment. The stimuli in these experiments are typically pho- tographs or videos of human faces. In this paper we present a novel approach using head-mounted eye-trackers which allows for auto- matic generation of gaze statistics for tasks performed in real-world environments. We use a trained hierarchy of Haar cascade classi- fiers to automatically detect and segment faces in the eye-tracker’s scene camera video. We can then determine if fixations fall within the bounds of the face or other possible regions of interest and report relevant gaze statistics. Our method is easily adaptable to any feature-trained cascade to allow for rapid object detection and tracking. We compare our results with previous research on the perception of faces in social environments. We also explore corre- lations between gaze and confidence levels measured during a mock interview experiment. Keywords: Face detection and tracking, eye-tracking, gaze statis- tics Concepts: Computing methodologies Object detection; Tracking; 1 Introduction In this paper we present a novel framework to automatically gen- erate gaze statistics from real-world environments. Eye-tracking studies are conducted in many fields including market research, human factors and ergonomics, virtual reality and gaming, sports training, and psychological and medical research. For this paper we focus on the use of eye-tracking to analyze how humans scrutinize the faces of others. Previous studies in this area have typically been conducted using desktop-mounted eye trackers where the subject is positioned in front of a flat screen and shown images or video of human faces while their eye movements are tracked. Furthermore many of these studies use a chin rest to minimize head movement or the subject is instructed to remain as still as possible during the experiment - neither of which is natural. It is often difficult to verify if conclusions drawn from these controlled laboratory experiments are applicable in real-world settings. The emergence of head-mounted eye trackers has allowed re- searchers to capture information about visual behavior and per- * e-mail:[email protected] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2016 ACM. SAP ’16,, July 22-23, 2016, Anaheim, CA, USA ISBN: 978-1-4503-4383-1/16/07 DOI: http://dx.doi.org/10.1145/2931002.2931005 Figure 1: Mock interview experiment setup. A - the interviewee. B - eye-tracking glasses. C - the interviewer (not wearing eye-tracking glasses). ceptual strategies of people engaged in tasks outside of the lab- oratory. Today, most of the commercial eye-tracking companies offer head-mounted eye-tracking solutions. We utilize lightweight eye-tracking glasses which are equipped with a front facing camera (scene camera) that captures the scene that the viewer is looking at as well as two rear facing cameras that capture binocular eye movements (eye cameras). Unlike desktop eye-trackers which are stationary relative to the flat display on which the stimuli are pre- sented, head mounted eye-trackers allow the viewer to freely move about the 3D environment, hence both the perspective and stim- uli will differ across subjects. This freedom comes at a price, as a significant amount of labor-intensive manual effort is required to identify and annotate regions of interest (ROIs) as bounding boxes in each frame of each subject’s eye tracker’s scene camera video (approximately an hour for a one-minute video). Software tools to facilitate this process are gradually emerging but are still very much in their infancy. For example, although extensive work has been done in the computer vision community to develop object de- tection and tracking techniques, little has been done to utilize these advances for gaze analysis on head-mounted eye-tracking systems. We present a framework which leverages object detection and track- ing techniques to facilitate automated analysis of gaze data in real world environments. We demonstrate our framework within the context of facial analysis. We use a trained hierarchy of Haar cas- cade classifiers to automatically detect and segment faces in the eye-tracker’s scene camera video. Once a face is detected, it is tracked from frame-to-frame which enables the automatic collec- tion of fixation data within the face boundary along with more fine grained analysis of regions within the face such as left/right eyes, nose, mouth, etc. We evaluate the effectiveness of our system by comparing our results with previous research on the perception of faces in social environments. We also explore correlations between gaze and confidence levels measured during a mock interview ex- periment in which interviewees wearing eye-tracking glasses inter- act with both male and female interviewers. 105

Upload: others

Post on 16-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

Looking at Faces: Autonomous Perspective Invariant Facial Gaze Analysis

Justin K. Bennett∗ Srinivas Sridharan Brendan John Reynold Bailey

Rochester Institute of Technology

Abstract

Eye-tracking provides a mechanism for researchers to monitorwhere subjects deploy their visual attention. Eye-tracking has beenused to gain insights into how humans scrutinize faces, however themajority of these studies were conducted using desktop-mountedeye-trackers where the subject sits and views a screen during theexperiment. The stimuli in these experiments are typically pho-tographs or videos of human faces. In this paper we present a novelapproach using head-mounted eye-trackers which allows for auto-matic generation of gaze statistics for tasks performed in real-worldenvironments. We use a trained hierarchy of Haar cascade classi-fiers to automatically detect and segment faces in the eye-tracker’sscene camera video. We can then determine if fixations fall withinthe bounds of the face or other possible regions of interest andreport relevant gaze statistics. Our method is easily adaptable toany feature-trained cascade to allow for rapid object detection andtracking. We compare our results with previous research on theperception of faces in social environments. We also explore corre-lations between gaze and confidence levels measured during a mockinterview experiment.

Keywords: Face detection and tracking, eye-tracking, gaze statis-tics

Concepts: •Computing methodologies → Object detection;Tracking;

1 Introduction

In this paper we present a novel framework to automatically gen-erate gaze statistics from real-world environments. Eye-trackingstudies are conducted in many fields including market research,human factors and ergonomics, virtual reality and gaming, sportstraining, and psychological and medical research. For this paper wefocus on the use of eye-tracking to analyze how humans scrutinizethe faces of others. Previous studies in this area have typically beenconducted using desktop-mounted eye trackers where the subject ispositioned in front of a flat screen and shown images or video ofhuman faces while their eye movements are tracked. Furthermoremany of these studies use a chin rest to minimize head movementor the subject is instructed to remain as still as possible during theexperiment - neither of which is natural. It is often difficult to verifyif conclusions drawn from these controlled laboratory experimentsare applicable in real-world settings.

The emergence of head-mounted eye trackers has allowed re-searchers to capture information about visual behavior and per-

∗e-mail:[email protected] to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]. c© 2016 ACM.SAP ’16,, July 22-23, 2016, Anaheim, CA, USAISBN: 978-1-4503-4383-1/16/07DOI: http://dx.doi.org/10.1145/2931002.2931005

Figure 1: Mock interview experiment setup. A - the interviewee. B -eye-tracking glasses. C - the interviewer (not wearing eye-trackingglasses).

ceptual strategies of people engaged in tasks outside of the lab-oratory. Today, most of the commercial eye-tracking companiesoffer head-mounted eye-tracking solutions. We utilize lightweighteye-tracking glasses which are equipped with a front facing camera(scene camera) that captures the scene that the viewer is lookingat as well as two rear facing cameras that capture binocular eyemovements (eye cameras). Unlike desktop eye-trackers which arestationary relative to the flat display on which the stimuli are pre-sented, head mounted eye-trackers allow the viewer to freely moveabout the 3D environment, hence both the perspective and stim-uli will differ across subjects. This freedom comes at a price, asa significant amount of labor-intensive manual effort is required toidentify and annotate regions of interest (ROIs) as bounding boxesin each frame of each subject’s eye tracker’s scene camera video(approximately an hour for a one-minute video). Software toolsto facilitate this process are gradually emerging but are still verymuch in their infancy. For example, although extensive work hasbeen done in the computer vision community to develop object de-tection and tracking techniques, little has been done to utilize theseadvances for gaze analysis on head-mounted eye-tracking systems.

We present a framework which leverages object detection and track-ing techniques to facilitate automated analysis of gaze data in realworld environments. We demonstrate our framework within thecontext of facial analysis. We use a trained hierarchy of Haar cas-cade classifiers to automatically detect and segment faces in theeye-tracker’s scene camera video. Once a face is detected, it istracked from frame-to-frame which enables the automatic collec-tion of fixation data within the face boundary along with more finegrained analysis of regions within the face such as left/right eyes,nose, mouth, etc. We evaluate the effectiveness of our system bycomparing our results with previous research on the perception offaces in social environments. We also explore correlations betweengaze and confidence levels measured during a mock interview ex-periment in which interviewees wearing eye-tracking glasses inter-act with both male and female interviewers.

105

Page 2: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

Figure 2: System architecture and data flow.

Our framework can be easily adapted to work with any trained cas-cade of classifiers to automatically detect and track objects in thescene camera video. We recognize that even the best detection al-gorithms will occasionally fail. In such cases, we provide a simplemechanism for the user to manually specify the region of interestusing the mouse. We then attempt to track in the subsequent frames.

The remainder of this paper is organized as follows: backgroundand related work is presented in section 2, the system design is pre-sented in section 3, the design of our mock interview experiment ispresented in section 4, analysis and discussion of the experimentalresults are presented in section 5, and the paper concludes in sec-tion 6 with a summary of the contributions and potential avenues offuture research.

2 Background & Related Work

Central (foveal) vision has very high acuity when compared to pe-ripheral vision. This variation in visual acuity is due in part to thedistribution of photoreceptors in the retina. The fovea has a diam-eter of less than 1mm and subtends an angle of approximately 2degrees of the visual field. This means that at any instant, less than0.05% of our field of view is seen in high resolution. We over-come this limitation by rapidly scanning the scene with eye move-ments called saccades to sample our surroundings. Humans makeapproximately 10, 000 saccades per hour separated by brief pausescalled fixations to focus on objects within the scene. Eye-trackingprovides a mechanism for monitoring where high-acuity vision isfocused in a scene or on a display. Modern eye-tracking systemsare video-based - a video feed of the subject’s eye is analyzed todetermine the center of the pupil relative to some stable feature inthe video (such as corneal surface reflection). A brief calibrationprocedure is necessary to establish the mapping of the subject’s eyeposition to locations within the scene. Wearable eye-trackers be-gan to emerge during the late 1990s and have already been used ina variety of scenarios such as driving [Sodhi et al. 2002], athlet-ics [Chajka et al. 2006], and the monitoring of mental health [Vidalet al. 2012].

Previous studies have indicated that eye movement patterns are

strongly influenced by the viewer’s intent or task-at-hand [Hender-son and Hollingworth 1998; Tatler et al. 2010]. Humans are alsonaturally drawn to regions of a stimuli that exhibit high local con-trast or edge density [Mannan et al. 1996], and regions that containmeaningful information such as faces [Mackworth and Morandi1967]. Finally, social context [Lansing and McConkie 1999; Lans-ing and McConkie 2003] as well as environmental factors such asnoise [Vatikiotis-Bateson et al. 1998], have been found to play arole in fixation distribution.

Eye-tracking has been previously utilized to study how humansscrutinize the faces of others. Henderson et al. [2005] studied eyemovements during free-viewing and restricted viewing conditionswhile learning new human faces. Their results indicate that moretime is allocated to studying the eyes and nose regions when a sub-ject is presented with a face they have not seen before. Severalstudies have revealed an attention bias towards the left side of theface when tasked with determining facial attractiveness, age, gen-der, and emotional expression. Burt et al. [1997] found that afterexposing subjects to facial images that were modified to be per-fectly symmetrical, that their perceptual judgments were biased bythe left side (viewer’s perspective) of the face. This informationconcurs with studies that have determined the left side of the faceto be more expressive during displays of emotion by using precise3D measurements [Nicholls et al. 2004]. However, when subjectswere tasked with lip reading, perception was dominated by infor-mation provided on the right side of the face. A possible explana-tion for this was proposed by Campbell et al. [1982]. They pro-vided evidence which shows that the right side of the mouth movessignificantly more than the left when a subject is speaking. Re-search suggests that these preferential gaze asymmetries may beattributed to the influential role of the brain’s right hemisphere infacial identification and processing of visual information [Gilbertand Bakan 1973], perception of facial expressions [Christman andHackworth 1993; Rhodes 1993; Schiff and Truchon 1993] and gen-der recognition [Luh et al. 1994] . The cultural background of theobserver has also been shown to influence how gaze is distributedon faces[Michel et al. 2006].

Additional eye-tracking experiments have been conducted on sub-jects with Autism Spectrum Disorder (ASD) to assess their ability

106

Page 3: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

Figure 3: Scene camera frame with a red bounding box indicatingthe tracked face ROI, blue rectangles indicating eye regions, and agreen rectangle surrounding the mouth. Subject gaze is visualizedusing a yellow cross-hair.

to recognize human facial emotions [Bal et al. 2010]. The resultsof these studies are still the subject of active debate - with someresearchers providing evidence that children with ASD have simi-lar fixation patterns as typical developing children [Bar-Haim et al.2006; Van der Geest et al. 2002], while other studies provide evi-dence to the contrary [Dalton et al. 2005; Klin et al. 2002; Pelphreyet al. 2002].

In order to resolve this debate, Bal et al. [2010] suggest that futurestudies be conducted with more robust data collection methods toaccommodate lower functioning children who typically do not sitstill during experiments. Head-mounted eye-trackers present a vi-able solution for this problem. Motivated by these observations, thegoal of this work is to construct an expandable framework that al-lows for automatic gaze analysis from head-mounted eye-trackersin real-world environments.

3 System Design

Figure 2 provides an overview of the system architecture and dataflow. The eye-tracker used in our experiment is a pair of Senso-Motoric Instruments (SMI) Eye Tracking Glasses. The resolutionof the eye-tracker’s scene camera is 960 x 720 at 30fps and thefield of view is 60◦ horizontal and 46◦ vertical. The gaze positionsare computed using SMI’s proprietary software at 30Hz. Note thatour framework is independent of the eye-tracker and as such can beutilized with any wearable eye-tracking device.

Object detection, in our case - faces, is performed on individualframes from the eye-tracker. Once the target object is detected,it is tracked in subsequent frames. If either detection or trackingfails, the frame is discarded or presented to the user for manualannotation. Finally gaze statistics are automatically computed fromfixation points that falls within the tracked object region.

3.1 Object Detection

For robust object detection we used a method proposed by Violaand Jones [2001]. A classifier is trained with example images con-taining the target object (positive images) and examples without theobject (negative images). The training images are convolved usingfeature windows and Haar-like features are calculated. The differ-ences in the values of these regions are learned to separate objectsfrom non-objects in images.

A cascade of classifiers was introduced to speed up object detec-tion by eliminating non-target regions from further examination. Ifa sampling window fails to meet the criteria of a classifier at anystage in the cascade, it is determined not to be a region of interest.Otherwise, it moves through the cascade until the window is dis-carded or passes all the classifiers, returning a positive match forthe object.

3.2 Consensus-based Matching and Tracking of Key-points for Object Tracking (CMT)

CMT [Nebehay and Pflugfelder 2014] is a tracking algorithm thatworks by generating keypoints within a region of interest. CMTtracks an object using optical flow vectors and descriptor based key-point matching algorithms. Keypoints from the current frame arematched with values from the previous frame. Once matched thesekeypoints are polled to determine the new location of the object.Once these votes are cast, clustering is performed on the predictedobject locations and outliers are discarded. The new location of theobject center is then generated for tracking the object through thevideo. For our system we use the OpenCV-Python implementationprovided by the authors1.

3.3 Perspective Invariant Facial Gaze Analysis

As shown in Figure 2, our system feeds the eye-tracker scene cam-era video to the object detection module. In this paper we use acascade of classifiers trained to detect human faces in each videoframe. Once a face region is successfully detected, the ROI in theform of a bounding box is provided as input to the object trackingmodule. The CMT tracking algorithm described above is used totrack the ROI in successive frames. In order to maintain accuracyin tracking, we compare the tracked ROI to the output of the ob-ject detection module every 30th frame (video frame rate: 30fps).During accuracy validation, if the Euclidean distance between thetracking result and the object detection result is larger than a giventhreshold (10% change in location or size), we re-initialize the ob-ject location using the results from the object detection module.This approach to facial detection and tracking from a moving scenecamera enables perspective invariant facial gaze analysis.

Once the face is successfully detected in the video frame, it is fur-ther segmented into sub-regions (left eye, right eye, and mouth) asshown in Figure 3. The left side and right side of the face are seg-mented by splitting the bounding box down the middle. The leftand right eye regions are segmented using a separate set of Haarpre-trained cascade of classifiers provided in OpenCV. The mouthregion is detected in the bottom half of the face using the pre-trainedcascade of classifiers provided by Castrillon et al. [2007].

The SMI Eye-Tracking Glasses generates a stream of gaze pointswhich have one-to-one correspondence to the scene camera videoframes. SMI’s fixation detection software classifies gaze points asbelonging to a fixation or not. Using an offline hierarchical searchstrategy, we first determine if gaze falls within the boundaries of the

1CMT: https://github.com/gnebehay/CMT

107

Page 4: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

face before checking if it is incident on any of the sub-regions. Oncethis automated data collection is complete we can apply existinggaze analysis and visualization techniques [Kurzhals et al. 2014].

4 Experiment Design

To evaluate our system we collected participant’s gaze data duringa mock interview experiment. The interview simulates a computingmajor meeting with an academic advisor to discuss internships, pro-gramming languages and tools, and academic resources, etc. Theappendix provides the complete list of questions and the post inter-view questionnaire.

4.1 Participants

Interviewees 28 participants (14 female, 14 male), between theages of 18 and 31 (avg. 22) volunteered to participate as intervie-wees in this study. All participants reported normal or corrected-to-normal vision with no color vision abnormalities and none ofthe participants wore glasses. Data from 2 male and 2 female par-ticipants were removed from gaze analysis due to excessive track-ing loss. 8 of the remaining interviewees were non-native Englishspeakers.

Interviewers Interviews were conducted by 2 male, and 2 femaleinterviewers (1 teaching assistant, 1 adjunct faculty member, and 2professional academic advisors).

4.2 Procedure

Each participant was interviewed by a male and female interviewer,where presentation order was counterbalanced such that 14 partic-ipants saw the male interviewer first, and 14 saw the female inter-viewer first.

The interview was conducted in a small room where the seating dis-tance between the interviewer and interviewee was approximately5ft. The interview setup is shown in Figure 1. Instructions wereread to interviewees at the start of the experiment, and they werethen seated and fitted with the SMI Eye Tracking Glasses. The par-ticipants were aware that their gaze was being tracked. A brief onepoint calibration procedure was performed using the provided SMIsoftware and a stationary target in the interview room.

The first interviewer entered the room, greeted the interviewee andproceeded to ask 5 interview questions, listed in the appendix. Theinterviewee’s eye-tracking data was recorded from the time the in-terviewer entered the room until the interview ended and the in-terviewer left. At that point, the subject’s eye-tracking accuracywas verified and a recalibration was performed if necessary. Thisprocess was repeated with the second interviewer and a new setof questions. The scene camera video and the raw gaze data wasrecorded for each interviewee, and separate data files were createdfor each interview session. Post experiment surveys were com-pleted by both the interviewee and interviewers.

5 Analysis & Discussion

In this section we report on the impact of various factors such as in-terviewee confidence, familiarity with the interviewer, gender, andnative language on interviewee dwell-time for specific regions ofthe interviewer’s face.

Figure 4: Distribution of interviewee performance self evaluationratings for both male and female interviewers. 1 = poor perfor-mance. 5 = excellent performance.

Figure 5: Distribution of interviewee assessment of the male andfemale interviewer question difficulty. 1 = easy. 5 = difficult.

Figure 6: Distribution of interviewer responses for interviewee per-formance. 1 = poor performance. 5 = excellent performance.

5.1 Post Experiment Survey Statistics

A high level summary of the post-interview questionnaire responsesis provided below:

Figure 7: Distribution of interviewer responses for intervieweeconfidence. 1 = not confident at all. 5 = extremely confident.

• 70% of the interviews were conducted by an interviewer whowas not familiar to the interviewees.

108

Page 5: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

Table 1: Average number of fixations and percentage of fixationsthat fell on the face. Bold entries indicate percentages that weresignificantly different between male and female interviewees.

Maleinterviewer

Femaleinterviewer Both

Maleinterviewee

1321 (72%) 908 (74%) 1098 (73%)

Femaleinterviewee

880 (83%) 887 (79%) 883 (80%)

Both 1091 (78%) 898 (76%) 990 (77%)

• Interviewees reported no gender preference when rankingtheir own comfort with the interviewers. There was a perfect50/50 split in these responses.

• Figure 4 shows the interviewees’ performance self-evaluationfor both male and female interviewers. 91% ranked their per-formance as good (4) or excellent (5).

• Figure 5 shows interviewee rating of question difficulty forboth male and female interviewers.

• Figure 6 shows the distribution of the interviewers’ assess-ment of interviewee performance.

• Figure 7 shows the distribution of the interviewers’ assess-ment of interviewee confidence.

5.2 Gaze Analysis

On average 77% of the fixations that occurred over the course theinterview were located on the face of the interviewer. Table 1 pro-vides a breakdown of fixation count and percentage by intervieweeand interviewer gender.

In the following sub-sections we present additional analysis by onlyconsidering the subset of fixations which fall within the face region(i.e. relative dwell-time).

5.2.1 Impact of Confidence on Gaze Behavior

To evaluate the impact of confidence on gaze behavior we clas-sify interviewees into two groups, confident and non-confident, byaveraging the confidence rating assigned by their interviewers. In-terviewees with an average confidence rating greater than 3 wereassigned to the confident group, and the remaining were assignedto the non-confident group. This categorization resulted in 12 inter-viewees being assigned to each group.

Table 2 presents the overall gaze statistics. Confidence results areshown in the first two rows. Confident participants tended to fixatemore on the mouth compared to non-confident participants. On theother hand non-confident participants tended to fixate more on theeyes than confident participants. However, these differences are notstatistically significant (p = 0.0584).

5.2.2 Impact of Familiarity on Gaze Behavior

To evaluate the impact of familiarity of the interviewer on the in-terviewees’ gaze behavior, we rely on the questionnaire responsesto classify interviewees into two groups. 14 participants stated thatthey knew at least one interviewer, whereas 10 participants statedthat they did not know either of the interviewers.

Table 2: Overall gaze statistics. Columns represent relative dwell-time on specific regions of the interviewer’s face. Bold entries in-dicate dwell-times that were significantly different between cate-gories.

Interviewee category Left(%)

Right(%)

Mouth(%)

Eyes(%)

Confident(12 participants)

52.3 47.7 22.5 4.8

Non-confident(12 participants)

49.5 50.5 11.1 9.4

Knows interviewer(14 participants)

53.1 46.9 21.8 3.4

Does not know interviewer(10 participants)

49.5 50.5 15.6 9.3

Native English speaker(16 participants)

50.9 49.1 13.0 7.9

Non-native English speaker(8 participants)

52.4 47.6 32.6 2.4

All interviewees 48.9 51.1 19.0 6.2

Rows 3 and 4 of Table 2 show the impact of familiarity on facialdwell-time. Results indicate a larger percentage (9.3%) in relativedwell-time in the eye regions for interviewees who were not famil-iar with the interviewer compared to interviewees who were famil-iar with the interviewers. An ANOVA concluded that this observeddifference was significant (see details in Table 4). This observationsupports the results of Henderson et al. [2005] that more time isspent studying the eyes when a subject is presented with a new face.

Figure 8 illustrates these differences using heat maps. Note that thevisualization techniques developed for desktop-based eye-trackingsystems are not directly applicable with head mounted eye-trackers.In desktop systems, the eye tracker is stationary relative to the dis-play on which the stimuli are presented, making it straightforwardto visualize the gaze behavior. However, with head mounted eye-trackers the viewer can move freely about the 3D environment,hence both the perspective and stimuli will differ over the courseof the video (and also across subjects). To account for these differ-ences we normalize the facial ROI bounding boxes and map all gazedata to normalized ROI coordinates. To provide some context, wedisplay the heat map over an image of the interviewer’s face fromone of the ROI mappings using the number of fixations (duration isnot considered).

5.2.3 Impact of Native Language on Gaze Behavior

To evaluate the impact of interviewees’ native language on theirgaze behavior, we rely on the information provided in the pre-experiment consent forms. 8 participants stated that English is nottheir first language, whereas 16 participants stated that English istheir native language.

Rows 5 and 6 of Table 2 show the impact of interviewees’ nativelanguage on facial dwell-time. Results indicate a larger percentage(7.9%) in relative dwell-time in the eye regions for intervieweeswhose native language is English compared to non-native Englishspeakers (2.4%). Results also show a larger percentage (32.6%) inrelative dwell-time in the mouth region for interviewees whose na-tive language is not English, compared to non-native English speak-ers (13.0%). An ANOVA concluded that these observed differenceswere significant (see details in Table 4). This observation showsthat non-native speakers spend more time during the interview try-

109

Page 6: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

Table 3: Impact of gender on relative dwell-time on various regions of the face.

Male interviewer (%)(Left , Right , Mouth , Eyes)

Female interviewer (%)(Left , Right , Mouth , Eyes)

Both interviewers (%)(Left , Right , Mouth , Eyes)

Male interviewee 57.0 43.0 12.8 16.8 63.3 36.7 9.44 6.42 46.2 53.8 16.0 4.95Female interviewee 52.8 47.2 11.6 12.4 52.2 47.8 27.7 8.12 51.5 48.5 22.0 7.49Both 55.1 44.9 17.9 8.41 48.0 52.0 19.7 4.24 48.9 51.1 19.0 6.22

Figure 8: Heat maps generated from an interviewee who knew theinterviewer (left) and one who had not met the interviewer prior(right).

Figure 9: Heat maps generated from a native English speaker (left)and a non-native English speaker (right) during the interview.

ing to read the lips of the interviewer. In fact it is evident from Ta-ble 2 that the relative dwell-time in the mouth region is the highestfor non-native English speakers. Figure 9 illustrates these differ-ences using heat maps.

5.2.4 Impact of Gender on Gaze Behavior

We separately consider the impact of interviewee gender and in-terviewer gender on gaze behavior. As shown in Table 1, femaleinterviewees spent significantly more time (80%) looking at the in-terviewer’s face compared to male interviewees (73%). Further-more female interviewees spent significantly more time gazing atthe face of the male interviewers (83%) than the male interviewees(72%). Within the face, we observed no significant preference tospecific sub-regions based on gender (see Table 3 for summary).

Table 4: Significant ANOVA results

Test Performed p val F val F-crit dof

Familiar vsnon-familiar (eye)

0.018 5.981 4.043 49

Native speaker vsnon-native speaker (mouth)

0.001 12.74 4.043 49

Native speaker vsnon-native speaker (eye)

0.005 4.149 4.043 49

Left-Right Bias Table 2 and Table 3 show that the dwell-time be-tween the left and right side of the face was roughly the same. Thelargest bias was present with male interviewees’ that did not speakEnglish as the first language. 61.3% of dwell-time spent on the leftside of the face, and 38.7% on the right. However, an ANOVA con-firmed that this effect was not significant (p = 0.5857). This devia-tion from previous results which suggest that viewers tend to focuson the left side of the face, may be indicative of the differences be-tween real-world and computer presented imagery, or perhaps theincreased cognitive load of the interview.

5.3 System Limitations

Object detection and identification provides many challenges. Oursystem is susceptible to common computer vision limitations suchas lighting, orientation, and feature specific problems that arisewhen attempting to detect and track regions. Issues with humanfacial features include head pose (looking away from the camera),facial occlusion, eye detection in the presence of glasses, and blink-ing.

Success in detecting the mouth and eye regions is crucial for ouranalysis. Detection of the mouth was largely successful, but de-tection of the eyes was prone to error. The main offender was eyeglasses worn by one of the female interviewers. Glasses occludeimportant features of the eyes leading to false negatives. Blinkingcan also be responsible for false negatives, however the number offrames where this is an issue is negligible. Frames where the inter-viewer is looking away from the camera also causes false negativesfor detection, however this scenario is less common in interviewsettings. Overall, the average percentage of frames where eye andmouth were not detected was less than 30%. Figure 10 illustratestwo examples of frames where object detection fails. These framesare automatically excluded from the subsequent analysis.

6 Conclusion and Future Work

We presented an autonomous system for generating gaze statisticswith arbitrary Haar cascade classifier-defined objects in real-worldenvironments. This system drastically reduces the time needed forgaze analysis by reducing the number of frames that need to bemanually annotated from the entire video to cases where the au-tomated system fails. We evaluated this system using a mock in-

110

Page 7: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

terview setup designed to examine how humans attend to faces insocial settings. Dwell-time over the left and right side of the inter-viewer’s face as well as facial sub-regions were calculated using ahierarchy of Haar cascade classifiers. Results show that gaze wasdistributed nearly evenly between the left and right sides of the in-terviewer’s face. This observation is contrary to previous studiesthat report a bias to the left side of the face. One possible ex-planation for this discrepancy is that these previous studies wereconducted in controlled laboratory settings with faces displayed ona flat computer screen with unnatural viewing restrictions placedon the viewer. Our results also show that the subjects not familiarwith the interviewer(s) had significantly higher relative dwell-timein the eye regions. This observation is in line with previous studiesshowing that more time is spent studying the eyes when a subjectis presented with a new face. Finally, our results show that non-native English speaking subjects had significantly higher relativedwell-time in the mouth region which suggest that they were tryingto read the lips of the interviewer.

Future Work Our framework can be applied to many real-worldenvironments including: medical research, sports, psychology, andmarket research. Future analysis would benefit from analyzingmore specific sub-regions. These include but are not limited to leftand right eye, ears, nose, and forehead. In order to aid detectionaccuracy de-blurring algorithms could also be applied to frames asthey are processed. For cases where a trained cascade of classifiersis not available, image segmentation techniques could be utilized toprovide a backup for the detection of objects and classification ofsub-regions based on gaze.

Furthermore, studies into environments such as workplace meet-ings, paired problem solving, and other collaborative interactionswould provide valuable insight into human behavior and the effectof social factors on gaze patterns. We hope to apply our frameworkto areas with a need for more user inclusive solutions. One examplewould be conducting experiments involving lower functioning chil-dren with ASD who can be fitted with eye-tracking glasses, but can-not use remote eye trackers. Finally, for the study presented in thispaper, we did not analyze audio information. However, linguisticanalysis could be performed using audio collected from the micro-phone on the eye-tracking glasses. This data can be used to explorethe interplay between speech and gaze in social interactions.

Acknowledgments

This material is based on work supported be the National ScienceFoundation under Award No. IIS-0952631. Any opinions, findings,and conclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views of theNational Science Foundation. This work is also supported by theNew York State Collegiate Science and Technology Entry ProgramResearch Grant at Rochester Institute of Technology.

Appendices

The list of interview questions and post interview questionnaireused for the experiment can be found below.

Appendix A: Interviewer Question Set 1

1. What programming language do you feel you are most com-fortable with?

2. Have you been on co-op before, if so where?

3. Have you ever visited the computer science tutoring center?

Figure 10: Examples of cases where detection fails. (Left) Blurryframe and occlusion results in face detection failure. (Right)Glasses and closed eyes results in eye detection failure.

4. Did you have programming experience prior to attending col-lege?

5. On a scale of 1 - 10, 10 being workforce ready, what wouldyour rate your CS skill set at and why?

Appendix B: Interviewer Question Set 2

1. When writing code for an assignments do you prefer to use anIDE or a simple text editor?

2. Do you use any source control when coding? i.e. git or SVN

3. Do you primarily use your own computer or a lab computerfor assignments?

4. What time complexity is faster? O(n) or O(logn)?

5. Have you done any web development during your student ca-reer?

Appendix C: Interviewer Questionnaire

1. On a scale of 1 to 5, 5 being most confident and 1 being notconfident how would rate the subject?

2. On a scale of 1 to 5, 5 being excellent and 1 being poor howwould rate the performance of the subject during the inter-view?

Appendix D: Interviewee Questionnaire

1. Which interviewer did you feel most comfortable with? (Maleor Female)

2. On a scale of 1 to 5, 5 being excellent and 1 being poor howdo you feel your interview with the male interviewer went?

3. On a scale of 1 to 5, 5 being difficult and 1 being easy howdifficult were the questions from the male interviewer?

4. Have you met the male interviewer before? (Yes or No)

5. On a scale of 1 to 5, 5 being excellent and 1 being poor howdo you feel your interview with the female interviewer went?

6. On a scale of 1 to 5, 5 being difficult and 1 being easy howdifficult were the questions from the female interviewer?

7. Have you met the female interviewer before? (Yes or No)

111

Page 8: Looking at Faces: Autonomous Perspective Invariant Facial Gaze …rjb/publications/papers/p105... · 2016-08-15 · experiment. The stimuli in these experiments are typically pho-tographs

References

BAL, E., HARDEN, E., LAMB, D., VAN HECKE, A. V., DENVER,J. W., AND PORGES, S. W. 2010. Emotion recognition in chil-dren with autism spectrum disorders: Relations to eye gaze andautonomic state. Journal of autism and developmental disorders40, 3, 358–370.

BAR-HAIM, Y., SHULMAN, C., LAMY, D., AND REUVENI, A.2006. Attention to eyes and mouth in high-functioning childrenwith autism. Journal of autism and developmental disorders 36,1, 131–137.

BURT, D. M., AND PERRETT, D. I. 1997. Perceptual asymmetriesin judgements of facial attractiveness, age, gender, speech andexpression. Neuropsychologia 35, 5, 685–693.

CAMPBELL, R. 1982. Asymmetries in moving faces. British Jour-nal of Psychology 73, 1, 95–103.

CASTRILLON, M., DENIZ, O., GUERRA, C., AND HERNANDEZ,M. 2007. Encara2: Real-time detection of multiple faces atdifferent resolutions in video streams. Journal of Visual Com-munication and Image Representation 18, 2, 130–140.

CHAJKA, K., HAYHOE, M., SULLIVAN, B., PELZ, J., MENNIE,N., AND DROLL, J. 2006. Predictive eye movements in squash.Journal of Vision 6, 6, 481–481.

CHRISTMAN, S. D., AND HACKWORTH, M. D. 1993. Equivalentperceptual asymmetries for free viewing of positive and negativeemotional expressions in chimeric faces. Neuropsychologia 31,6, 621–624.

DALTON, K. M., NACEWICZ, B. M., JOHNSTONE, T., SCHAE-FER, H. S., GERNSBACHER, M. A., GOLDSMITH, H.,ALEXANDER, A. L., AND DAVIDSON, R. J. 2005. Gaze fixa-tion and the neural circuitry of face processing in autism. Natureneuroscience 8, 4, 519–526.

GILBERT, C., AND BAKAN, P. 1973. Visual asymmetry in percep-tion of faces. Neuropsychologia 11, 3, 355–362.

HENDERSON, J. M., AND HOLLINGWORTH, A. 1998. Eye move-ments during scene viewing: An overview. Eye guidance inreading and scene perception 11, 269–293.

HENDERSON, J. M., WILLIAMS, C. C., AND FALK, R. J. 2005.Eye movements are functional during face learning. Memory &cognition 33, 1, 98–106.

KLIN, A., JONES, W., SCHULTZ, R., VOLKMAR, F., AND CO-HEN, D. 2002. Visual fixation patterns during viewing of nat-uralistic social situations as predictors of social competence inindividuals with autism. Archives of general psychiatry 59, 9,809–816.

KURZHALS, K., HEIMERL, F., AND WEISKOPF, D. 2014.Iseecube: Visual analysis of gaze data for video. In Proceedingsof the Symposium on Eye Tracking Research and Applications,ACM, 43–50.

LANSING, C. R., AND MCCONKIE, G. W. 1999. Attention to fa-cial regions in segmental and prosodic visual speech perceptiontasks. Journal of Speech, Language, and Hearing Research 42,3, 526–539.

LANSING, C. R., AND MCCONKIE, G. W. 2003. Word identifica-tion and eye fixation locations in visual and visual-plus-auditorypresentations of spoken sentences. Perception & Psychophysics65, 4, 536–552.

LUH, K. E., REDL, J., AND LEVY, J. 1994. Left-and right-handerssee people differently: Free-vision perceptual asymmetries forchimeric stimuli. Brain and cognition 25, 2, 141–160.

MACKWORTH, N. H., AND MORANDI, A. J. 1967. The gazeselects informative details within pictures. Perception & Psy-chophysics 2, 11, 547–552.

MANNAN, S. K., RUDDOCK, K. H., AND WOODING, D. S. 1996.The relationship between the locations of spatial features andthose of fixations made during visual examination of briefly pre-sented images. Spatial vision 10, 3, 165–188.

MICHEL, C., ROSSION, B., HAN, J., CHUNG, C.-S., AND CAL-DARA, R. 2006. Holistic processing is finely tuned for faces ofone’s own race. Psychological Science 17, 7, 608–615.

NEBEHAY, G., AND PFLUGFELDER, R. 2014. Consensus-basedmatching and tracking of keypoints for object tracking. In Ap-plications of Computer Vision (WACV), 2014 IEEE Winter Con-ference on, IEEE, 862–869.

NICHOLLS, M. E., ELLIS, B. E., CLEMENT, J. G., ANDYOSHINO, M. 2004. Detecting hemifacial asymmetries inemotional expression with three-dimensional computerized im-age analysis. Proceedings of the Royal Society of London-B 271,1540, 663–668.

PELPHREY, K. A., SASSON, N. J., REZNICK, J. S., PAUL, G.,GOLDMAN, B. D., AND PIVEN, J. 2002. Visual scanning offaces in autism. Journal of autism and developmental disorders32, 4, 249–261.

RHODES, G. 1993. Configural coding, expertise, and the righthemisphere advantage for face recognition. Brain and cognition22, 1, 19–41.

SCHIFF, B. B., AND TRUCHON, C. 1993. Effect of unilateralcontraction of hand muscles on perceiver biases in the perceptionof chimeric and neutral faces. Neuropsychologia 31, 12, 1351–1365.

SODHI, M., REIMER, B., COHEN, J., VASTENBURG, E., KAARS,R., AND KIRSCHENBAUM, S. 2002. On-road driver eye move-ment tracking using head-mounted devices. In Proceedings ofthe 2002 symposium on Eye tracking research & applications,ACM, 61–68.

TATLER, B. W., WADE, N. J., KWAN, H., FINDLAY, J. M., ANDVELICHKOVSKY, B. M. 2010. Yarbus, eye movements, andvision. i-Perception 1, 1, 7–27.

VAN DER GEEST, J., KEMNER, C., VERBATEN, M., ANDVAN ENGELAND, H. 2002. Gaze behavior of children withpervasive developmental disorder toward human faces: A fixa-tion time study. Journal of Child Psychology and Psychiatry 43,5, 669–678.

VATIKIOTIS-BATESON, E., EIGSTI, I.-M., YANO, S., ANDMUNHALL, K. G. 1998. Eye movement of perceivers duringaudiovisualspeech perception. Perception & Psychophysics 60,6, 926–940.

VIDAL, M., TURNER, J., BULLING, A., AND GELLERSEN, H.2012. Wearable eye tracking for mental health monitoring. Com-puter Communications 35, 11, 1306–1311.

VIOLA, P., AND JONES, M. 2001. Rapid object detection us-ing a boosted cascade of simple features. In Computer Visionand Pattern Recognition, 2001. CVPR 2001. Proceedings of the2001 IEEE Computer Society Conference on, vol. 1, I–511–I–518 vol.1.

112