journal of la speak2label: using domain knowledge for

12
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset Shreya Ghosh Student Member, IEEE, Abhinav Dhall Member, IEEE, Garima Sharma Student Member, IEEE, Sarthak Gupta Student Member, IEEE, Nicu Sebe Senior Member, IEEE Abstract—Labelling of human behavior analysis data is a complex and time consuming task. In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed. Domain knowledge can be added to the data recording paradigm and later labels can be generated in a automatic manner using speech to text conversion. In order to remove the noise in STT due to different ethnicity, the speech frequency and energy are analysed. The resultant Driver Gaze in the Wild DGW dataset contains 586 recordings, captured during different times of the day including evening. The large scale dataset contains 338 subjects with an age range of 18-63 years. As the data is recorded in different lighting conditions, an illumination robust layer is proposed in the Convolutional Neural Network (CNN). The extensive experiments show the variance in the database resembling real-world conditions and the effectiveness of the proposed CNN pipeline. The proposed network is also fine-tuned for the eye gaze prediction task, which shows the discriminativeness of the representation learnt by our network on the proposed DGW dataset. Index Terms—Eye Gaze Estimation, Self Driving car, Driver Monitoring. I. I NTRODUCTION T HE progress in deep learning based human behavior analysis is mainly driven by the availability of large labelled datasets. It has been observed that the process of labelling becomes non-trivial for complicated tasks. In this paper, we argue that by adding domain knowledge about the task during the data recording paradigm, one can automatically label the dataset fairly quickly. The behavior task chosen in this paper is driver gaze zone estimation. Distracted driving is one of the main causes of traffic accidents [10]. It is important to understand the far-reaching negative impacts of this killer, which is particularly common among younger drivers [10], [20]. According to a World Health Organization report 1 , there were 1.25 million road traffic deaths globally in 2013 and it is increasing day by day. In order to prevent this, efforts are being made for Advanced Driver Assistance Systems (ADAS), which will ensure smooth and safe driving by alerting the driver or taking control of the S. Ghosh, A. Dhall and G. Sharma are with the Human- Centered Artificial Intelligence Group at Monash University. (E- mails: [email protected], [email protected] and [email protected]). A. Dhall and S. Gupta are with the Indian Institute of Technology, Ropar. (Email: [email protected]) N. Sebe is with the Department of Information Engineering and Computer Science, University of Trento, Italy (E-mail: [email protected]). 1 https://www.who.int/gho/road safety/mortality/en/ Fig. 1: Images from proposed Driver Gaze in the Wild (DGW) dataset. Please note the different recording environments and age range. car (handover), when a driver is distracted or fatigued? For the this process, ADAS needs to estimate the driver’s gaze behaviour, in particular, where is the driver looking. Over past few years, monitoring driver’s behaviour as well as visual attention have become interesting topic of research [27], [41]. Analysis of both driver’s head-pose and eye-gaze is an integral part of a driver monitoring system. A driver’s sparse gaze zone estimation is an important cue for understanding a driver’s mental state. In vision-based driver behaviour monitor- ing systems, coarse gaze direction prediction instead of exact gaze location is usually acceptable [15], [40], [37], [33], [36], [39]. The coarse gaze regions are defined as the in-vehicle areas, where drivers usually look at while driving, for e.g. windshield, rear-view mirror, side mirrors and speedometer etc. As per recent studies [39], [15], head pose information is also relevant in predicting the gaze direction. This hypothesis fits well with real and natural driving behaviour. In many cases, a driver may move both head and eyes, while looking at a target zone. Accurate driver’s gaze detection requires a very specific set arXiv:2004.05973v1 [cs.CV] 13 Apr 2020

Upload: others

Post on 19-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Speak2Label: Using Domain Knowledge forCreating a Large Scale Driver Gaze Zone

Estimation DatasetShreya Ghosh Student Member, IEEE, Abhinav Dhall Member, IEEE, Garima Sharma Student Member, IEEE,

Sarthak Gupta Student Member, IEEE, Nicu Sebe Senior Member, IEEE

Abstract—Labelling of human behavior analysis data is acomplex and time consuming task. In this paper, a fully automatictechnique for labelling an image based gaze behavior dataset fordriver gaze zone estimation is proposed. Domain knowledge canbe added to the data recording paradigm and later labels can begenerated in a automatic manner using speech to text conversion.In order to remove the noise in STT due to different ethnicity, thespeech frequency and energy are analysed. The resultant DriverGaze in the Wild DGW dataset contains 586 recordings, capturedduring different times of the day including evening. The largescale dataset contains 338 subjects with an age range of 18-63years. As the data is recorded in different lighting conditions,an illumination robust layer is proposed in the ConvolutionalNeural Network (CNN). The extensive experiments show thevariance in the database resembling real-world conditions andthe effectiveness of the proposed CNN pipeline. The proposednetwork is also fine-tuned for the eye gaze prediction task, whichshows the discriminativeness of the representation learnt by ournetwork on the proposed DGW dataset.

Index Terms—Eye Gaze Estimation, Self Driving car, DriverMonitoring.

I. INTRODUCTION

THE progress in deep learning based human behavioranalysis is mainly driven by the availability of large

labelled datasets. It has been observed that the process oflabelling becomes non-trivial for complicated tasks. In thispaper, we argue that by adding domain knowledge about thetask during the data recording paradigm, one can automaticallylabel the dataset fairly quickly. The behavior task chosen inthis paper is driver gaze zone estimation.

Distracted driving is one of the main causes of trafficaccidents [10]. It is important to understand the far-reachingnegative impacts of this killer, which is particularly commonamong younger drivers [10], [20]. According to a WorldHealth Organization report1, there were 1.25 million roadtraffic deaths globally in 2013 and it is increasing day by day.In order to prevent this, efforts are being made for AdvancedDriver Assistance Systems (ADAS), which will ensure smoothand safe driving by alerting the driver or taking control of the

S. Ghosh, A. Dhall and G. Sharma are with the Human-Centered Artificial Intelligence Group at Monash University. (E-mails: [email protected], [email protected] [email protected]). A. Dhall and S. Gupta are with theIndian Institute of Technology, Ropar. (Email: [email protected])N. Sebe is with the Department of Information Engineering and ComputerScience, University of Trento, Italy (E-mail: [email protected]).

1https://www.who.int/gho/road safety/mortality/en/

Fig. 1: Images from proposed Driver Gaze in the Wild (DGW)dataset. Please note the different recording environments andage range.

car (handover), when a driver is distracted or fatigued? Forthe this process, ADAS needs to estimate the driver’s gazebehaviour, in particular, where is the driver looking.

Over past few years, monitoring driver’s behaviour as wellas visual attention have become interesting topic of research[27], [41]. Analysis of both driver’s head-pose and eye-gaze isan integral part of a driver monitoring system. A driver’s sparsegaze zone estimation is an important cue for understanding adriver’s mental state. In vision-based driver behaviour monitor-ing systems, coarse gaze direction prediction instead of exactgaze location is usually acceptable [15], [40], [37], [33], [36],[39]. The coarse gaze regions are defined as the in-vehicleareas, where drivers usually look at while driving, for e.g.windshield, rear-view mirror, side mirrors and speedometeretc. As per recent studies [39], [15], head pose information isalso relevant in predicting the gaze direction. This hypothesisfits well with real and natural driving behaviour. In many cases,a driver may move both head and eyes, while looking at atarget zone.

Accurate driver’s gaze detection requires a very specific set

arX

iv:2

004.

0597

3v1

[cs

.CV

] 1

3 A

pr 2

020

Page 2: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

of sensors [27], [41], [2], which capture detailed informationabout the eyes and pupil movements. These above-mentionedsensors can be broadly classified into two types: intrusiveand non-intrusive. The intrusive sensors [27] require contactwith human skin or eyes. Thus, these devices can causean unpleasant user experience. On the other hand, the non-intrusive sensors do not require physical contact [41], [2].Mainly, image processing based gaze estimation methods areleveraged in this domain. These methods face a number ofchallenges like occlusion, blurriness, illumination, head-posevariation, specular reflection etc. Additionally, manual datalabelling is a tedious task, which requires time as well asdomain knowledge. In this work, fully automatic labelling canbe fairly quickly done by introducing speech during the datasetrecording. Our Speech to Text (STT) based labelling techniquereduces the above-mentioned limitations of the earlier works.

II. MAIN CONTRIBUTIONS

The main contributions of this paper are as follows:1) Traditionally, computer vision based datasets are either

labelled manually or using sensors, which makes thelabelling process complicated. To the best of our knowl-edge, this is the first method for speech based automaticlabelling of human behavior analysis dataset. The pro-cess of labelling in our paper is automatic and doesnot require any invasive sensors or manual labelling.This makes the task of collecting and labelling datawith fairly large number of participants faster. Proposedmethod requires lesser time (∼30 sec) generating thelabels as compared to manual labelling (∼10 min).Additionally, the voice frequency-based detection is usedfor extracting data samples missed by automatic speechto text method.

2) An ‘in the wild’ dataset - Driver Gaze in the Wild(DGW) containing 586 videos of 338 different subjectsis collected. To the best of our knowledge, this is thelargest publicly available driver gaze estimation dataset.

3) As the dataset has been recorded with different il-lumination conditions, a convolutional layer robust toillumination is proposed.

4) We have performed eye gaze representation learning tojudge the generalization performance of our network.

The rest of the paper structure is as follows: SectionIII describes the prior works in this area. Section IV andV describe the annotation process and the data collection.Proposed methods are described in Section VI. Experimentaldetails and results are discussed in Section VII and SectionVIII. Conclusion, limitations and future research are discussedin Section IX.

III. PRIOR WORK

With the progress in autonomous and smart cars, the re-quirement for automatic driver monitoring has been observedand researchers have been working on this problem for a fewyears now. For driver’s gaze zone estimation, eye tracking isthe most evident solution. Eye tracking can be done via sensorsor by using computer vision based techniques.

Study # ofSubjects

# ofZones Illumination Labelling

Choi et al. [5] 4 8 Bright &Dim 3D Gyro.

Lee et al. [22] 12 18 Day ManualFridman et al. [8] 50 6 Day ManualTawari et al. [32] 6 8 Day Manual

Vora et al. [37] 10 7 Diff.day times Manual

Jha et al. [15] 16 18 Day Head-band

Wang et al. [39] 3 9 Day MotionSensor

DGW (Ours) 338 9 Diff. day timesincluding evening Auto

TABLE I: Comparison of in-car gaze estimation datasets.Here, Sub = Subjects.

A. Gaze Tracking for Driver’s Gaze Zone Estimation

1) Sensor based Gaze Tracking: Sensor based trackingmainly utilize dedicated sensor integrated hardware devices formonitoring driver’s gaze in real-time. These devices requireaccurate pre-calibration and additionally these devices areexpensive. Few examples of these sensors are Infrared (IR)camera [18], contact lenses [27] and head-mounted devices[15], [13], [14]. Robinson et al. [27] used two magnetic fieldsin quadrature phase and two coils on the lens to measure simul-taneously the horizontal, vertical and torsional eye movements.All of these above-mentioned systems have sensitivity towardsoutdoor lighting, difficulty in hardware calibration and systemintegration. Additionally, constant vibrations and jolts duringdriving can effect system’s performance. Bergasa et al. [4]proposed a hardware based system to monitor driver’s levelof vigilance in real-time. The hardware consists of an activeinfrared illuminator, which calculates the percent eye closure,eye closure duration, blink frequency, nodding frequency, faceposition, and fixed gaze parameters. Along with the hardware,a software based visual behaviour monitoring system is alsoused. Finally, a fuzzy classifier is used to infer the inatten-tiveness level of the driver. Similarly [4], [17], [16] used IRcamera to detect pupils. Thus, it is worthwhile to investigateimage processing based zone estimation techniques.

2) Vision based Gaze Tracking: Prior studies for visionbased gaze tracking are mainly focused on two types of zoneestimation methods: head-pose based only and both head-poseand eye-gaze based.

In an interesting work, Lee et al. [22] introduced a vision-

Fig. 2: Few samples from our DGW dataset, where the headpose and eye gaze differ for subjects. The labelling processshould not be oblivious to such scenarios. The labels shouldalso not be based on the head pose only as in some prior works[15], [39].

Page 3: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

based real-time gaze zone estimator based on a driver’s headmoment mainly composed of yaw and pitch. Further, Tawariet al. [33] presented a distributed camera based framework forgaze zone estimation using head pose only. Additionally, [33]collected a dataset from naturalistic on-road driving in streets,though containing six subjects only. For the gaze zone groundtruth determination, human experts manually labelled the data.Driver’s head pose provides partial information regarding thehis/her gaze direction as there may be an interplay betweeneye ball moment and head pose [9]. Hence, methods totallyrelying on head pose information may fail to disambiguatebetween the eye movement with fixed head-pose. Later, Tawariet al. [32] combined head pose with horizontal and vertical eyegaze for robust estimation of driver’s gaze zone. Experimentalprotocols are evaluated on the dataset collected by [33] andit shows improved performance overhead moment [33]. Inanother interesting work, Fridman et al. [8], [9] proposeda generalized gaze zone estimation using the random forestclassifier. They validated the methods on a dataset containing40 drivers and with cross driver testing (test on the unseendrivers). When the ratio of the classifier prediction having thehighest probability to the second highest probability is greaterthan a particular threshold, the decision tree branch is pruned.Similarly, [35] combined 3D head pose with both 3D and 2Dgaze information to predict gaze zone via a support vectormachine classifier. Choi et al. [5] proposed the use of deeplearning based techniques to predict categorized driver’s gazezone.

Recently, Wang et al. [39] proposed an Iterative ClosetPoints (ICP) based head pose tracking method for appearance-based gaze estimation. To avoid the tricky initialization ofpoint-cloud transformation, they introduced a particle filterwith auxiliary sampling for head pose tracking and it isaccelerated the iterative convergence of ICP. The labellingis performed initially using a head motion sensor and laterclustering is used on this head pose technique. The labelsin this case, do not consider the scenario, where there isa difference between the eye gaze and the head pose of asubject. In another interesting work, Jha et al. [15] map 6Dhead pose (three head position and three head rotation angles)to an output map of quantized gaze zone. The users in theirstudy wear headbands, which are used to label the data usingthe head pose information only.

Few of the selected methods are also described in TableI. Please note that most of the databases are not availablepublicly, with the exception of Lee et al. [22], though itcontains 12 subjects only. It is easily observable that ourproposed database CarZone has a large number of subjectsand more diverse illumination settings. Further, the methodsdiscussed above require either manual labelling of the driverdatabase or it is based on a wearable sensor. We argue that thelabelling of gaze can be noisy and erroneous task for labellersdue to the task being monotonous. Further, with wearablesensors such a headband, it may be uncomfortable for somesubjects. Therefore, in this work, we propose an alternatemethod of using speech as part of the data recording. Thisremoves the need for manual labelling and the user having towear any headgear as well.

B. Self Supervised Learning

Nowadays, self-supervised learning is getting attention asit has the potential to overcome the limitation of supervisedlearning based algorithms, which requires large amount oflabelled data. A few recent works [23], [24], [38] explored thisdomain. The labelling technique used by us in this work alsoexploits the domain knowledge (speech) and helps in labellinga fairly large database quickly.

C. Other Relevant Gaze Estimation Methods

For predicting eye contact during a conversation, Muller etal. [25] analysed people’s gaze and speaking behaviour in labenvironment using a weakly labelled the ‘eye contact’ basedon Gaussian kernel density estimation. Further, the constantshifts in gaze estimation is corrected using speech information.Zhang et al. [43] detected eye contact via clustering methodover the pre-trained CNN features. As compared to both theseinteresting works, the problem discussed in our paper is quitedifferent as we are interested in predicting the zone, where thedriver is looking at? This considers the both eye gaze, headpose (Figure 2) and the interplay of gaze and head pose.

IV. DGW DATASET

We curate an driver gaze zone estimation database asdatasets in this domain are small in size and are mostly notavailable for academic purpose. The database contains 586videos of 338 subjects (247 male and 91 females) havingduration of 15-20 sec. Figure 1 shows the frames from theproposed dataset, where subjects are looking at 9 gaze zonecategories. The dataset will be made publicly available.

A. Data Recording Paradigm

The data has been collected in a Hyundai Verna car withdifferent subjects at the driver’s position. We pasted numberstickers on different gaze zones of the car (Please refer to Fig-ure 3). The nine car zones are chosen from back mirror, sidemirrors, radio, speedometer and windshield. The recordingsensor is a Microsoft Lifecam RGB camera, which containsa microphone as well. For recording the following protocol isfollowed:

1) We asked the subjects to look at the zones marked withnumbers in different orders. For each zone, the subjecthas to fixate on a particular zone number and speakthe zone’s number and then move to the next zone. Forrecording realistic behaviour, no constraint is mentionedto the subjects about looking by eye movements and/orhead movements. The subjects chose the way in whichthey are comfortable. This leads to more naturalistic data(see Figure 2).

2) For the subjects who wear spectacles, if it is comfortablefor the participant, they are requested to record twice i.e.with and without the spectacles.

3) The RA was also present in the car and observed thesubject and checked the recorded video. If there was amismatch between the zone and the gaze, the subject

Page 4: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 3: Overview of the automatic data annotation technique. On the top are the representative frames from each zone. Pleasenote the numbers written in alphabets below the curve. The red colored ‘to’ shows the incorrect detection by the STT library.This is correct by frequency analysis approach in Section V-B. On the bottom right are reference car zones.

Fig. 4: Age distribution in the DGW dataset. (Best Viewed inColour)

repeated the recording. This insures correct driver gazeto car zone mapping.

The data is collected during different times of the dayfor recording different illumination settings (as evident inFigure 1). Recording sessions are also conducted during theevening after sunset at different locations in the university.This enabled different sources of illumination from streetlights (light emitting diodes, compact fluorescent lamp andsodium vapour lamps) and also from inside the car. There are afew sessions during which the weather was cloudy. This bringshealthy amount of variation in the data w.r.t. illuminationsettings.

B. Dataset Statistics

Following are the few relevant data statistics:

1) Age: The curated DFW dataset has variation in subject’sage. The age ranges from 18-63 years. Fig. 4 depicts thegroup-wise age distribution.

2) Gender Distribution: The data consists of 247 maleand 91 female subjects.

3) Daytime Distribution: 55.2% of the data is collectedin daylight and 44.8% during evening with different lifesources. The daytime recording sessions were conductedduring both sunny and cloudy weathers. The recordingwas performed in different places in the university re-sulting in variation due to different illumination sources.

4) Duration: The average length of a subject’s recordingsession is 15-20 sec. The data was recorded over oneand half month time span.

5) Face Size: The average face size in the dataset is179 × 179 (in pixel). To measure the face size, Dlibface detection library is used.

6) Other Attributes: Specular reflection add more chal-lenge to any gaze estimation data. In DGW data, 30.17%of the subjects have prescribed spectacles. The subjectshaving prescribed spectacles are requested to record datawith and without spectacle (subject to the condition ifthey can). Thus, two different settings correspondingto these subjects were recorded. Additionally, there arevariations in head pose due to sitting posture of theparticipants.

V. AUTOMATIC DATA ANNOTATION

As manual data labelling can be an erroneous andmonotonous task, we rely on automatic data labelling. For thatpurpose, we exploit the spoken numbers by the participantsduring the data recording.

Page 5: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 5: DBSCAN clustering of head pose along yaw, pitch and roll axis for zones 2, 3, 5 and 7 (left to right) respectively.Large number of clusters within a zone, implies that we cannot only rely on head pose information for labelling the data. (bestviewed in colour)

A. Speech to Text

Post extraction of the audio from the recorded samples, theIBM Watson’s STT API 2 is used to convert the audio signalinto text. We searched for the keywords ‘one’, ‘two’, ‘three’,‘four’, ‘five’, ‘six’, ‘seven’, ‘eight’ and ‘nine’ in the extractedspeech in ascending order. As we recorded the data in ‘one’ to‘nine’ in sequence, therefore, sequentially ordered texts havinghigh probability is considered. Further, we extracted the framescorresponding to the detected time stamps by adding an offset(10 frames chosen empirically) before and after the detectionof the zone number. We used the US English broadband model(16 kHz and 8kHz). In a few cases, this model was unable todetect correctly, this may be due to different pronunciation ofEnglish words across different cultures. In order to overcomethis limitation, we applied STT rectification (Section V-B).

B. STT Rectification

Generally, human voice frequency lies in the range of 300-3000 Hz [34]. We use the frequency and energy domainanalysis of the audio signal to detect time duration of the audiosignal from the data. The assumption based on the recordingparadigm of DGW dataset is that the numbers are spoken ina sequence. If during scan of the numbers generated from theSTT process, there is a mismatch for a particular number, thefollowing steps are executed to find the particular zone’s data:

1) Convert stereo input signal to mono audio signal andstart scanning with a fixed window size T .

2) Calculate frequency over the time domain of the audiofor a window size T .

3) If the frequency lies in the human voice range 300-3000Hz, then this window is a probable candidate.

4) Compute ratio between energy of speech band and totalenergy for this window. If ratio is above a threshold,then this window is a probable candidate.

5) If there is an overlap between the timestamps generatedfrom steps 3 and 4 above, the zone label is assigned tothe frames between the timestamps.

This process extracted an extra 4000 frames, which wereearlier missed due to the noise generated from STT. We have

2https://speech-to-text-demo.ng.bluemix.net/

also confirmed manually for some recording randomly, most ofthe useful data has been extracted using the complete processabove.

C. Analysis of Automatic Data Annotation

1) Head Pose Variation: We analyse the variation in headpose values w.r.t. their zones by computing the density basedclustering [7] on the head pose information3 (i.e. yaw, pitchand roll). We observed that for zones 1-3, the head-pose ismainly centrally concentrated (forming 2, 3 and 2 clusters,respectively). For the remaining zones, there are more than 5clusters each. This experiment indicates that it may be noisy,if we label data by considering only head-pose (Wang et al.[39]). Figure 5 shows the clustering results of zones 2, 3, 5and 7.

2) Comparison With Manual Annotation Process: For com-paring the automatic annotation with manual annotation, ex-pert and non-expert annotators are assigned. There were 3annotators (2 experts and 1 non-expert4) who were assignedfor this task. We asked the annotators to label 15 videos.Further, we calculate few statistics to judge the quality oflabelling. 2 experts take approximately 10-15 min (on average)to annotate each video. The mean squared errors of automaticlabel with the 2 expert annotator’s labels for 15 videos are 0.49and 0.54 respectively. Similarly, the mean squared error in caseof non-expert annotator is 0.78. The cohen’s kappa betweenthe expert annotators is 0.8. At microsecond level, there is ahigh probability of wrong labelling for human annotation.

VI. METHOD

In this section, we discuss the proposed networks forestimating a driver’s gaze zone.

A. Preliminaries

1) Baseline: For the baseline methods, we experiment withseveral standard networks like Alexnet [21], Resnet [11] andInception Network [31]. The input to the network is thecropped face computed using the Dlib face detection library.

3https://github.com/yinguobing/head-pose-estimation4Here, expert means those are familiar with the labelling process in machine

learning and non-expert means outside people.

Page 6: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

The proposed network is shown in Figure 6. The baselinenetwork takes 224× 224× 3 facial image as input.

From the results using the standard networks mentionedabove, one of the limitations observed is that as the DGWdataset has been recorded in diverse illumination conditions,some samples, which contained illumination change across theface are mis-classified. Sample images can be seen in Figure9a. To the backbone network, we add the illumination layerpresented below.

2) Illumination Invariant Layer: For illumination invariantfacial image generation, we follow a common assumptionproposed by Lambert and Phong [26]. The authors adoptedthe concept of ideal matte surface, which obey Lambert’scosine law. The law states that the incoming incident lightat any point of an object surface is diffused uniformly inall possible directions. Later, Phong has added a specularhighlight modelization term with Lambertian model. This termis independent of the object’s geometric shape. Moreover, it isalso independent of the lighting direction of each surface point.As per [42], both models can be formulated by the followingequations:

Ldiffuse = SdEd(n . l) (1)

Ldiffuse + Lspecular = SdEd(n . l) + SsEs(v . r)γ (2)

In Equation (1), Sd is diffusion reflection coefficient; Eddenotes the diffuse lighting intensity; n corresponds to normalvector and l denotes normal vector along the direction ofincoming light [42]. Similarly, in Equation (2), Ss is thespecular reflection coefficient; Es denotes the specular lightingintensity; v is the normal vector along observation directionand r is the normal vector along the reflected light. γ isa constant termed as ‘shininess constant’. For illuminationinvariant learning, we follow the computationally efficientChromaticity property (Zhang et al. [42]). c = {r, g, b} isform the following skin color formation equation:

ci =fiλi

−5S(λi)

(∏3j=1 fjλj

−5S(λj))

1

3

× e−k2λiT

e

1

3∑3

j=1 −k2λjT

(3)

Here, i = {1, 2, 3}, correspond to the R, G & B channels,respectively. fi is from Dirac delta function; λi’s are tri-chromatic wavelengths 5; S(λ) is spectral reflectance function

of skin surface; k2 =hc

kB6 refer to first and second radia-

tion constants in Wien’s approximation and T represents thelighting color temperature. If we write the Equation (3) inci = A × B format, then the left part of Equation 3 is theillumination invariant (A) and right part (B) is illuminationdependent due to the colour temperature factor T , which

5the wavelengths of R, G, B lights wherein {λ1 ∈ [620, 750], λ2 ∈[495, 570], λ3 ∈ [450, 495], unit : nm}

6h: Plank’s constant h = 6.626 × 10−34J.s, kB : Boltzmann’s constantkB = 1.381× 10−23J.k−1 and c = 3× 108ms−1

Decoder Network DetailsLayer Filters Kernel

Conv 2D 1024 3x3Upsampling – 2x2

Conv 2D 128 3x3Upsampling – 4x4

Conv 2D 128 3x3Upsampling – 4x4

Conv 2D 64 3x3Conv 2D 3 3x3

TABLE II: The details of the decoder network.

varies throughout the dataset. Thus, for illumination invariantfeature extraction, we initialize a constant kernel having the Tindependent value of part (B). T is initialized with a Gaussiandistribution. Further, the product of constant and Gaussiankernel is considered for learning.

3) Attention based Gaze Prediction: The eye region of aperson’s face is important in estimating driver’s gaze zone asit gives vital information about the eye gaze. We already showthat there are images in DGW dataset (Figure 2), where thehead pose is frontal even though the driver may be lookingat a particular zone, which is not in the front using thechange in the eye gaze. Motivated by this hypothesis, we addattention augmented convolution module [3] to the network.Let’s consider a convolution layer having Fin input filters,Fout output filters and k kernels. H and W represent the heightand width of an activation map. dv and dk denote the the depthof values and the depth of queries/keys in MultiHead-Attention

(MHA). v =dvFout

is the ratio of attention channels to number

of output filters and ka =dkFout

is the the ratio of key depth

to number of output filters.The Attention Augmented Convolution (AAConv) [3] can

be written as follows:

AAConv(X) = Concat[Conv(X),MHA(X)] (4)

Where, X is the input. MHA consists of a 1×1 convolutionwith Fin input filters and (2dk + dv) = Fout(2ka + v) outputfilters to compute queries/keys and values. An additional

1 × 1 convolution with dv =Foutv

input and output filtersis also added to mix the contribution of different key heads.AAConv is robust to translation and different input resolutiondimensions.

B. Proposed Network Architecture

The proposed network architecture is shown in the leftbox of Figure 6. In this part of the network, we basicallyperform the gaze-zone classification task with Inception V1as backbone network. The input of this network is facialimage. The illumination robust layer and attention layer areintroduced in the beginning and end of the backbone networkto enhance the performance. After attention layer, the resultantembedding is passed through two dense layers (1024, 512)before predicting gaze-zone.

Page 7: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 6: LEFT: Overview of the proposed network (Section VI-B). RIGHT: Label refinement architecture (Section VI-C).

C. Label Refining

Our proposed automatic data annotation may generate noisylabels during the gaze transition between two zones in the car.For example, a subject utters the word ‘one’ and looks atregion ‘one’. After that the subject shifts gaze from region‘one’ to region ‘two’ and utters ‘two’. During the transitionbetween these two utterances, some frames may have beenincorrectly annotated. Similarly, in a few cases, the zoneutterance and the shifting of gaze may not have occurredsimultaneously. To handle such situations, we perform labelrectification based on an encoder decoder network followedby latent feature clustering.

Encoder-Decoder. The encoder part of the network is basedon the backbone network (inception V1) as above. The decodernetwork consists of series of alternate convolution and up-sampling layers. The details of decoder network is mentionedin Table II. The convolution layers have 1024, 128, 128, 64and 3 kernels having 3×3 dimension. It is to be noted that thefacial embedding representation learnt in this network encodesthe eye gaze with the head pose information.

Clustering. After learning the auto-encoder, we perform k-means clustering on the facial embeddings. Here the value ofk=9 is same as the number of zones. After k-mean clusteringof all the samples, previous labels of the transition frames areupdated on the basis of its Euclidean distance from the clustercenter. More specifically, we measure the distance between atransition frame’s facial embedding with the 9 cluster centers

Fig. 7: Overview of the optical flow based pruning method(Section VII-A2). Rejected frame is marked with a red cross.

and assign the frame the label of the nearest neighbor cluster.The refined labels are then considered as the ground truth labelfor the database.

VII. EXPERIMENTS

We use the Keras deep learning library with Tensorflowbackend for the implementation of the networks. Our gazezone prediction pipeline consists of face detection and networklearning.

Data partition: The dataset is divided into train, validationand test sets. The partition is performed randomly. 203 subjectsare used in training partition, 83 subjects are used in validationpartition and the rest of the 52 subjects are used in test partitionrespectively. Having unique identities in the data partitionshelps in learning more generic representations.

A. Data Pre-processing

After labelling the curated data, few pre-processing methodsare performed to remove noise from the training data.

1) Face Detection: For face detection purpose, dlib facedetection library [28] is used with low threshold value asthere are large illumination variation in the dataset. As a resultof the low threshold value, the false face detection rate alsoincreased.

2) Optical Flow-based Face Pruning: In order to eliminatethe false face detection issue, we compute dense optical flow[1] across the detected face frames. If two consecutive frameshave high Forbenius norm of the optical flow magnitude (i.e.above an empirically decided threshold), we discarded the laterone. An example of face pruning is shown in Figure 7 in whichthe third frame is discarded due to its high Forbenius normvalue. The comparison pairs are also marked in the figure.This removes the incorrectly detected faces in the training set.

B. Experimental Setup

Experiments are conducted with the following settings:1) Baseline: Inception-V1 network is used as a backbone

network;2) Baseline + Illumination Layer: On top of Inception-

V1, an illumination robust layer is introduced;3) Baseline + Attention: In addition to the illumination

layer, attention augmented convolution layer is intro-duced;

Page 8: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

4) Baseline + Illumination Layer + Attention: Thisnetwork is implemented for taking advantage of bothillumination robust layer and attention;

5) Performance with standard backbone networks: Per-formance of several state-of-the-art networks are ex-plored;

6) Ablation study on illumination layer: The ablationstudy is conduced to get an overview of the illuminationrobust layer. Our investigation includes where to deploythe layer and layer details which includes filter size;

7) Eye-gaze representation learning: For this purpose, thedriver gaze network is fine-tuned for eye-gaze estimationtask. For this purpose, CAVE and TabletGaze datasetsare used;

8) Evaluation on Nvidia Jetson Nano platform: In futurefor in-car deployment, a test evaluation on Nvidia JetsonNano platform is conducted.

C. Evaluation Matrix

Overall accuracy in % is used as evaluation matrix for gazezone prediction. For gaze representation learning, the angularerror (in °) is used as evaluation matrix for CAVE dataset[30] and mean error (in cm) is used for TabletGaze [12]. ForCAVE dataset, the angular error is calculated as mean error±std. deviation (in °).

D. Training Details

For the backbone network, inception-V1 network archi-tecture is used. For training the following parameters areused: 1) SGD optimizer with 0.01 learning rate with 1× e−6

decay per epoch. 2) Kernels are initialized with Gaussiandistribution with initial bias value 0.2. 3) In each case,the models are trained for 200 epochs with batch size 32.For illumination layer, The following parameters are used:λ1 ∈ [620, 750], λ2 ∈ [495, 570], λ3 ∈ [450, 495], h = 6.626×10−34J.s, kB = 1.381 × 10−23J.k−1 and c = 3 × 108ms−1.For using the value of λ, the average of the range is considered.

For gaze representation learning, we fine tune our proposednetwork with the following changes. Two FC layers (256,256 node dense layers for the TabletGaze and 1024, 512node dense layers for the CAVE dataset)are added with ReLUactivation to fine tune the network. The learning rate was setto 0.001 with SGD optimizer. For both the eye gaze datasets,we froze the first 50 layers of the network and fine tuned therest part.

Networks 9 Gaze Zone Classification Accuracy (%)

Network Network +Illumination Layer

Alexnet 56.25 57.98Resnet-18 59.14 60.87Resnet-50 58.52 60.05Inception V1 60.10 61.46

TABLE III: Comparison of performance of the backbonenetworks on the proposed DGW dataset with original labels.

VIII. RESULTS

A. Network Performance

1) Experiment with state-of-the-art backbone networks:We experimented with several network architectures to get anoverview of the trade-off between the number of parametersand accuracy. Specifically, we choose lightweight networkslike AlexNet, ResNet-18, ResNet-50 and Inception. Amongthese networks due to robust handling of different scales,the inception network performs better. The further results arebased on the Inception V1 as the backbone network. Based onthis empirical analysis, the baseline network is the inceptionV1 network plus global average pooling and Fully Connected(FC) layers. The quantitative analysis of these networks areshown in Table III. It also reflects that the addition ofillumination invariant layer increases the performance. It iseffective in real-world scenarios in which different sources ofillumination play vital role and many existing techniques maynot perform properly. The classification performance increasesas illumination robust layer is added to the baseline networkas compared to the baseline network only.

2) 9-zones Vs 7-zones: Additionally, a study is conductedon the baseline network regarding the optimal numbers of gazezones. Although the data is collected on the 9-zone settings,there is a scope to merge few zones. Zone 1 and zone 2 couldbe merged and treated as a single large zone. Similarly, zone5 and zone 6 can be merged. Thus, 9-zones are reduced to 7-zones. From Table IV, it is observed that this 7-zone partitionperforms better than 9-zone. The validation accuracy of 7and 9 zones are 66.56% and 60.10% respectively. Similarly,there is approximately 7% increment in test accuracy. Thisquantitative results indicate that it is easier to differentiate 7zones as compared as the zones are sparse. Thus to add morechallenge, 9-zone is considered for further analysis. Moreover,it is better to have more dense zones in real world driver gazeanalysis task.

Inception V1(Trained with original labels) Accuracy (in %)

Validation TestBaseline (7-Zones) 66.56 67.39

(9-Zones)

Baseline (EmotiW 2020) 60.10 60.98Baseline + Attention 60.75 60.08Baseline + Illumination 61.46 60.42Baseline + Attention + illumination 64.46 62.90

TABLE IV: Comparison based on classification accuracy ofthe proposed network architecture with and without illumina-tion invariant layer and attention.

Illumination Layer Layer Details Accuracy (in %)Validation Test

Dense Layer 1024 56.16 57.934096 58.51 61.18

Convolution Layer

32 53.48 52.1664 60.47 58.38128 61.46 60.42256 57.71 57.35

TABLE V: Variation in network performance (in %) acrossthe different zones w.r.t the illumination layer size.

Page 9: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

Dataset [19] [29] [6] Oursx y x y x y x y

CAVE 1.67± 1.19 3.47± 3.99 2.65± 3.96 4.02± 5.82 1.67± 1.19 1.74± 1.57 2.17± 0.91 1.31± 0.73

TABLE VI: Results on the CAVE dataset (of °0 yaw angle) using angular deviation, calculated as mean error±std. deviation(in °).

3) Improvements over baseline:Quantitative Analysis. Table IV shows the gradual improve-ment over the baseline due to the addition of attention layer,illumination layer. By adding the attention layer, we introducedguided learning. In the next step, we added an illuminationrobust layer to encode illumination invariant features, whichalso increase the performance of the model. Our final modelhas both illumination and attention layer followed by FClayers.

Significance Test. One way ANOVA test is performedon the models to calculate the statistical significance of themodels. The p-values of the ‘baseline+illumination’, ‘base-line+attention’ and ‘baseline+illumination+attention’ modelsare 0.03, 0.04 and 0.01 respectively. The p-values of themodels are < 0.05 which indicate the results are statisticallysignificant.

Qualitative Analysis. Figure 9a shows few examples wherepreviously mis-classified images are classified correctly afterthe addition of the illumination layer. Mostly spectacle glare,partial information regarding eyes can be recovered by illumi-nation layer.

4) Label Modification Performance: To avoid error in au-tomatic labelling process, clustering based label modificationis performed.

Qualitative Analysis. After clustering the labels of approx-imately 400 frames have changed. The most impact comesto the zone 9. Frames in this zone increased by 5.2% ofall frames. Zone 8 have overall increase of 3.5%. Zone5 have 2.3% increment of frames. Other zones have lessthan 1% increament. This suggests that whenever there is asignificant distance change in among consecutive zones, theerror increases.

Quantitative Analysis. After label modification, the vali-dation accuracy changes from 64.46% to 66.44% as shown inTable VIII.

5) Evaluation in Test set: For all of the methods, test setperformances are calculated. Both the ‘baseline+illumination’and ‘baseline+attention’ perform slightly lower than the base-line (Table IV). The combined effect of illumination andattention improves the test set performance from 60.98%to 62.90%. Error in automatic labelling could be the causefor this performance. Further, training is performed on‘train+validation’ set and performance is evaluated on test set.The test performance is increased to 64.31% over the base-line (60.98%) and ‘baseline+illumination+attention’ network(62.90%).

B. Ablation Study

1) Effect of Illumination layer: We have also conducted ex-periments for observing the variation in network performance

w.r.t the illumination layer variation. The comparison is shownin Table V.

Position Vs Performance: First of all, the position of theillumination layer need to be determined. For this analysis, thelayer is implemented in the beginning and end conv-layers.For the beginning conv-layer, the performance increases. Onthe other hand, the end conv-layer (before the flatten layer)performance is not increased. The reason could be that afterseveral convolution and maxpooling operation, it is difficult todifficult illumination robust low-level features.

Filter size Vs Performance: This analysis is conductedin two settings: 1) The layer is implemented in convolutionallayer and 2) The layer is implemented in dense layer (fullyconnected layer). From the table, we can observe that as theillumination layer filter size increases, the performance is alsoincreases. We have used the baseline + illumination + attentionframework to compute those results.

2) Effect of Lip Movement: To check the effect of lipsmovement on the network, we perform the following exper-iment: We crop the eye region up to nose tip point and re-train the gaze zone prediction. This resulted in drop of overallaccuracy for the baseline network. Figure 8 shows the overallpipeline of the aforementioned experiment. It is interesting tonote that the performance difference is minimal due to theeffect of loss of head pose information, when the eyes areused as input only.

3) Eye Gaze Representation Learning: In order to analyzewhether our network learnt a generalized face representation?We extract the features from weights trained on DGW and finetuned for the task of eye gaze estimation. We fine-tuned ourdata on the Columbia gaze [30] (CAVE) and TabletGaze [12]datasets. The results are shown in Table VI and Table VII,

MethodsTabletGaze

Rawpixels[12]

LoG[12]

LBP[12]

HoG[12]

mHoG[12] Ours

k-NN 9.26 6.45 6.29 3.73 3.69

3.77RF 7.2 4.76 4.99 3.29 3.17GPR 7.38 6.04 5.83 4.07 4.11SVR - - - - 4.07

TABLE VII: Results on Tablet Gaze with comparison tobaselines [12]. Effectiveness of the learnt features from DGWdatabase is demonstrated here (Section VIII-B3).

Methods Accuracy (in %) F1 ScoreValidation Test Validation Test

ProposedNetwork 64.46 62.90 0.52 0.52

Train on(Train + Val) - 64.31 - 0.59

Label Modified 66.44 61.98 0.63 0.59

TABLE VIII: Results for proposed methods. The ‘InceptionV1 + Illumination + Attention’ model is used for the experi-ments.

Page 10: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Fig. 8: Lip movement effect analysis pipeline (SectionVIII-B2).

respectively. In the case of CAVE, our model stabilizes thestandard deviation. Similarly, for TabletGaze, the fine tunednetwork works well. These quantitative results indicate thatour proposed network has learnt efficient representation fromthe DGW dataset.

4) Real Time Deploy: It is important to note that a morecomplex backbone network may achieve better performance.We chose Inception V1 due to relatively better performance(Table III) and to be able to evaluate the performance on asmall platform like Nvidia Jetson Nano7. These networks willbe expected to run on close to real-time in a car based com-puter. On Nvidia Jetson Nano the best network (from TableIV), runs at 10 Frames Per Second (FPS) with 34,532,961parameters.

C. Qualitative Analysis of Failure Cases

Figure 9b shows few examples where our methods fail todetect the correct gaze zones. The reason may be incompleteinformation of the eyes in these samples. All the cases (exceptbottom left) eyes are not visible properly and illuminationlayer even unable to recover information. In these cases headpose and neighbour frame’s gaze information could be vitalclue to predict gaze zone. In this case, there could be aninterplay between pose and gaze.

IX. CONCLUSION, LIMITATIONS AND FUTURE WORK

In this paper, we show that automatic labelling can be per-formed by adding domain knowledge during the data recordingprocess. We propose a large scale gaze zone estimation dataset,which is labelled fully automatically using the STT conversiontechnique. It is observed that the missed information fromSTT can be recovered by analyzing the frequency and energyof the audio signal. The dataset recordings are performedin different illumination conditions, which makes the datasetcloser to the realistic scenarios and also non-trivial from aprediction perspective. To take care of the varying illuminationacross the face, we propose an illumination invariant layer inour network. The results show that the illumination invariantlayer is able to correctly classify some samples, which havedifferent or low illumination. Further, experiments on eye gazeprediction using the features learnt from our network on theDGW dataset, show that the features learnt are useful for faceanalysis task.

In order to record even more realistic data, car drivingalso needs to be added in the data recording paradigm. Thetrickier part is about how to use speech effectively in this caseas the drivers will be concentrating on the driving activity.Perhaps, a smaller subset of driving dataset can be labelled

7https://www.nvidia.com

(a)

(b)

Fig. 9: (a) Correctly classified samples with illuminationinvariant layer (Section VI-A2). Without the layer, the thesesamples were mis-classified by the baseline network. (b) Fewincorrectly classified samples by our network.

using a network trained on the existing stationary recordeddataset and it can be validated by human labellers. Further,the DGW dataset will be extended with more female subjectsto balance the current gender distribution. At this point, ourmethod does not consider the temporal information. It will beinteresting to understand the effect of temporal information onthe gaze estimation as a continuous regression-based problem.Few prior works used IR cameras [18], [39] for their superiorperformance in dealing with illumination effects such as onthe driver glasses. Our use of a webcam-based RGB cameravalidated the process of STT labelling and illumination invari-ance. It will be of interest to try distillation based knowledgetransfer from our DGW dataset into the smaller sized networklater fine-tuned on smaller gaze estimation datasets.

Currently, on Nvidia Jetson Nano, we achieve 10 FPS. Itshould further improve if network optimization techniquessuch as quantization and separable kernels are experimentedwith. Our proposed method is implicitly learning the dis-criminativeness, due to the head pose. In the future, we planto integrate the head pose information explicitly to evaluateits usefulness. We will also evaluate the performance of thenetwork and the usefulness of the learnt features for thetask of distracted driver detection. One direction can be jointgaze zone and distraction detection as a multi-task learningproblem. The performance of the features extracted from ournetwork for the task of eye gaze prediction motivate us toexplore more face related problems such as face verificationand head pose estimation in future.

REFERENCES

[1] L. Alvarez, J. Weickert, and J. Sanchez. Reliable estimation of denseoptical flow fields with large displacements. International Journal ofComputer Vision, 39(1):41–56, 2000.

[2] S. Baluja and D. Pomerleau. Non-intrusive gaze tracking using artificialneural networks. In Advances in Neural Information Processing Systems,pages 753–760, 1994.

[3] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le. Attentionaugmented convolutional networks, 2019.

[4] L. M. Bergasa, J. Nuevo, M. A. Sotelo, R. Barea, and M. E. Lopez.Real-time system for monitoring driver vigilance. IEEE Transactionson Intelligent Transportation Systems, 7(1):63–77, 2006.

Page 11: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

[5] I.-H. Choi, S. K. Hong, and Y.-G. Kim. Real-time categorizationof driver’s gaze zone using the deep learning techniques. In 2016International Conference on Big Data and Smart Computing (BigComp),pages 143–148. IEEE, 2016.

[6] N. Dubey, S. Ghosh, and A. Dhall. Unsupervised learning of eye gazerepresentation from the web, 2019.

[7] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-basedalgorithm for discovering clusters in large spatial databases with noise.In Kdd, volume 96, pages 226–231, 1996.

[8] L. Fridman, P. Langhans, J. Lee, and B. Reimer. Driver gaze estimationwithout using eye movement. arXiv:1507.04760, 2(4), 2015.

[9] L. Fridman, J. Lee, B. Reimer, and T. Victor. ‘owl’and ‘lizard’: patternsof head pose and eye pose in driver gaze classification. IET ComputerVision, 10(4):308–314, 2016.

[10] E. Gliklich, R. Guo, and R. W. Bergmark. Texting while driving: Astudy of 1211 us adults with the distracted driving survey. PreventiveMedicine Reports, 4:486–489, 2016.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 770–778, 2016.

[12] Q. Huang, A. Veeraraghavan, and A. Sabharwal. Tabletgaze: uncon-strained appearance-based gaze estimation in mobile tablets. arXiv,2015.

[13] S. Jha and C. Busso. Challenges in head pose estimation of driversin naturalistic recordings using existing tools. In 2017 IEEE 20thInternational Conference on Intelligent Transportation Systems (ITSC),pages 1–6. IEEE, 2017.

[14] S. Jha and C. Busso. Fi-cap: Robust framework to benchmark head poseestimation in challenging environments. In 2018 IEEE InternationalConference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2018.

[15] S. Jha and C. Busso. Probabilistic estimation of the gaze region of thedriver using dense classification. In 2018 21st International Conferenceon Intelligent Transportation Systems (ITSC), pages 697–702. IEEE,2018.

[16] Q. Ji and X. Yang. Real time visual cues extraction for monitoring drivervigilance. In International Conference on Computer Vision Systems,pages 107–124. Springer, 2001.

[17] Q. Ji and X. Yang. Real-time eye, gaze, and face pose tracking formonitoring driver vigilance. Real-time imaging, 8(5):357–377, 2002.

[18] M. W. Johns, A. Tucker, R. Chapman, K. Crowley, and N. Michael.Monitoring eye and eyelid movements by infrared reflectance oculogra-phy to measure drowsiness in drivers. Somnologie-Schlafforschung undSchlafmedizin, 11(4):234–242, 2007.

[19] S. Jyoti and A. Dhall. Automatic eye gaze estimation using geometric& texture-based networks. In IEEE International Conference on PatternRecognition, 2018.

[20] L. Knapp. “https://theharrispoll.com/pop-quiz-what-percentage-of-drivers-have-brushed-or-flossed-their-teeth-behind-the-wheel-while-its-crazy-to-think-that-anyone-would-floss-their-teeth-while-cruising-down-the-highway-it/”. Erie Insurance, 2015.

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105, 2012.

[22] S. J. Lee, J. Jo, H. G. Jung, K. R. Park, and J. Kim. Real-timegaze estimator based on driver’s head orientation for forward collisionwarning system. IEEE Transactions on Intelligent TransportationSystems, 12(1):254–267, 2011.

[23] M. Leo, D. Cazzato, T. De Marco, and C. Distante. Unsupervised eyepupil localization through differential geometry and local self-similarity.Public Library of Science, 2014.

[24] X. Liu, J. Van De Weijer, and A. D. Bagdanov. Exploiting unlabeleddata in cnns by self-supervised learning to rank. IEEE transactions onPattern Analysis and Machine Intelligence, 2019.

[25] P. Muller, M. X. Huang, X. Zhang, and A. Bulling. Robust eye contactdetection in natural multi-person interactions using gaze and speakingbehaviour. In Proceedings of the 2018 ACM Symposium on Eye TrackingResearch & Applications, page 31. ACM, 2018.

[26] B. T. Phong. Illumination for computer generated pictures. Communi-cations of the ACM, 18(6):311–317, 1975.

[27] D. Robinson. A method of measuring eye movemnent using a scieralsearch coil in a magnetic field. IEEE Transaction on Bio-MedicalElectron., 1963.

[28] S. Sharma, K. Shanmugasundaram, and S. K. Ramasamy. Farec—cnnbased efficient face recognition technique using dlib. In 2016 Interna-tional Conference on Advanced Communication Control and ComputingTechnologies (ICACCCT), pages 192–195. IEEE, 2016.

[29] E. Skodras, V. G. Kanas, and N. Fakotakis. On visual gaze tracking basedon a single low cost camera. Signal Processing: Image Communication,2015.

[30] B. A. Smith, Q. Yin, S. K. Feiner, and S. K. Nayar. Gaze locking: passiveeye contact detection for human-object interaction. In Proceedings of the26th annual ACM symposium on User interface software and technology,pages 271–280. ACM, 2013.

[31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinkingthe inception architecture for computer vision. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages2818–2826, 2016.

[32] A. Tawari, K. H. Chen, and M. M. Trivedi. Where is the driver looking:Analysis of head, eye and iris for robust gaze zone estimation. In 17thInternational IEEE Conference on Intelligent Transportation Systems(ITSC), pages 988–994. IEEE, 2014.

[33] A. Tawari and M. M. Trivedi. Robust and continuous estimation ofdriver gaze zone by dynamic analysis of multiple face videos. In2014 IEEE Intelligent Vehicles Symposium Proceedings, pages 344–349.IEEE, 2014.

[34] I. R. Titze and D. W. Martin. Principles of voice production, 1998.[35] B. Vasli, S. Martin, and M. M. Trivedi. On driver gaze estimation:

Explorations and fusion of geometric and data driven approaches. In2016 IEEE 19th International Conference on Intelligent TransportationSystems (ITSC), pages 655–660. IEEE, 2016.

[36] S. Vora, A. Rangesh, and M. M. Trivedi. On generalizing driver gazezone estimation using convolutional neural networks. In 2017 IEEEIntelligent Vehicles Symposium (IV), pages 849–854. IEEE, 2017.

[37] S. Vora, A. Rangesh, and M. M. Trivedi. Driver gaze zone estimationusing convolutional neural networks: A general framework and ablativeanalysis. IEEE Transactions on Intelligent Vehicles, 3(3):254–265, 2018.

[38] K. Wang, L. Lin, C. Jiang, C. Qian, and P. Wei. 3d human pose machineswith self-supervised learning. IEEE transactions on Pattern Analysis andMachine Intelligence, 2019.

[39] Y. Wang, G. Yuan, Z. Mi, J. Peng, X. Ding, Z. Liang, and X. Fu.Continuous driver’s gaze zone estimation using rgb-d camera. Sensors,19(6):1287, 2019.

[40] Y. Wang, T. Zhao, X. Ding, J. Bian, and X. Fu. Head pose-free eyegaze prediction for driver attention study. In 2017 IEEE InternationalConference on Big Data and Smart Computing (BigComp), pages 42–46.IEEE, 2017.

[41] D. Xia and Z. Ruan. IR image based eye gaze estimation. InIEEE ACIS International Conference on Software Engineering, ArtificialIntelligence, Networking, and Parallel/Distributed Computing, 2007.

[42] W. Zhang, X. Zhao, J.-M. Morvan, and L. Chen. Improving shadowsuppression for illumination robust face recognition. IEEE transactionson Pattern Analysis and Machine Intelligence, 41(3):611–624, 2019.

[43] X. Zhang, Y. Sugano, and A. Bulling. Everyday eye contact detectionusing unsupervised gaze target discovery. In Proceedings of the 30thAnnual ACM Symposium on User Interface Software and Technology,pages 193–203. ACM, 2017.

Shreya Ghosh is currently persuing her PhD atMonash University, Australia. She received MS(R)degree in the Computer Science and Engineeringfrom the Indian Institute of Technology Ropar, India.She received the bachelor’s degree in ComputerScience and Engineering in 2016 from the Govt.College of Engineering and Textile Technology (Ser-ampore, West-Bengal, India). Her research interestsinclude Affective computing, computer vision, DeepLearning. She is a student member of the IEEE.

Page 12: JOURNAL OF LA Speak2Label: Using Domain Knowledge for

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Abhinav Dhall is currently lecturer of the Human-Centered Artificial Intelligence Group at MonashUniversity, Australia. He received the PhD degreein computer science from the Australian NationalUniversity in 2014. After this, he did postdocs atthe University of Waterloo and the University ofCanberra. He was also an adjunct research fellow atthe Australian National University. He was awardedthe Best Doctoral Paper Award at ACM InternationalConference on Multimodal Interaction 2013, BestStudent Paper Honourable mention at IEEE Interna-

tional Conference on Automatic Face and Gesture Recognition 2013 and BestPaper Nomination at IEEE International Conference on Multimedia and Expo2012. His research interests are in computer vision for Affective computingand Assistive Technology. He is a member of the IEEE.

Garima Sharma is currently persuing her PhDfrom Monash University, Australia. She received herM.Tech in 2015 from Himachal Pradesh University,India. Earlier to this, she did her B.Tech Honorsin 2013 from University Institute of InformationTechnology, Himachal Pradesh University, India.Her research interests include Affective Computing,Computer Vision, Deep Learning. She is a studentmember of the IEEE.

Sarthak Gupta received his B.tech degree in theComputer Science and Engineering from the IndianInstitute of Technology Ropar, India. His researchinterests include computer vision, Deep Learning.He is a student member of the IEEE.

Nicu Sebe is currently a Professor with the Uni-versity of Trento, Italy, leading the research in theareas of multimedia information retrieval and humanbehaviour understanding. He was the General Co-Chair of the IEEE FG Conference 2008 and theACM Multimedia 2013, and the Program Chair ofthe International Conference on Image and VideoRetrieval in 2007 and 2010 and the ACM Multi-media 2007 and 2011. He is the Program Chair ofICCV 2017 and ECCV 2016 and the General Chairof ACM ICMR 2017 and ICPR 2017. He is a fellow

of the International Association for Pattern Recognition.