human visual behaviour for collaborative human-machine ... · we have been studying human-computer...

Human Visual Behaviour forCollaborative Human-MachineInteraction

Andreas BullingPerceptual User InterfacesGroupMax Planck Institute forInformaticsSaarbrucken, [email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the ACM must be honored. Abstractingwith credit is permitted. To copy otherwise, or republish, to post on servers orto redistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’15 , September 7–11, 2015, Osaka, Japan.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-3574-4/15/09...$15.00.http://dx.doi.org/xx.xxxx/xxxxxxx.xxxxxxx

AbstractNon-verbal behavioural cues are fundamental to humancommunication and interaction. Despite significantadvances in recent years, state-of-the-art human-machinesystems still fall short in sensing, analysing, and fully

“understanding” cues naturally expressed in everydaysettings. Two of the most important non-verbal cues, asevidenced by a large body of work in experimentalpsychology and behavioural sciences, are visual (gaze)behaviour and body language. We envision a new class ofcollaborative human-machine systems that fully exploitthe information content available in non-verbal humanbehaviour in everyday settings through joint analysis ofhuman gaze and physical behaviour.

ACM Classification KeywordsH.5.m [Information interfaces and presentation (e.g.,HCI)]: Miscellaneous

IntroductionHuman interactions are complex, adaptive to the situationat hand, and rely to a large extend on non-verbalbehavioural cues. However, state-of-the-arthuman-machine systems still fall short in fully exploitingsuch cues. Despite significant advances in recent years,current human-machine systems typically use cues andsensing modalities in isolation, only cover a limited and

coarse set of human behaviours, and methods to perceiveand learn from behavioural cues are mainly developed andevaluated in controlled settings.

We envision a new class of human-machine systems thatfully exploit the information content available in naturalnon-verbal human behaviour in everyday settings throughjoint analysis of multiple behavioural cues. Multimodalanalysis of non-verbal cues has significant potential andwill pave the way for a new class of symbiotichuman-machine systems that offer human-like perceptualand interactive capabilities. Human gaze and bodylanguage are particularly compelling cues given that alarge body of work in the behavioural sciences has shownthat these cues are rich sources of information andtherefore most promising for realising our vision.

This vision poses three key research challenges. The firstchallenge is human behaviour sensing, i.e. thedevelopment of computational methods to unobtrusively,robustly, and accurately estimate gaze and bodymovement in daily-life settings. The second challenges iscomputational human behaviour analysis, i.e. thedevelopment of machine learning methods to analyse,understand, and learn from visual and physicalbehavioural cues and that cope with the significantvariability and subtleness of human non-verbal behaviour.The third challenge is how to exploit and applyinformation on non-verbal cues in symbiotichuman-machine vision systems. In the following we willprovide an overview of our previous and ongoing efforts toaddress these challenges.

Human Behaviour Sensing

Figure 1: Sample images fromour MPIIGaze dataset showingthe considerable variability interms of place and time ofrecording, directional light andshadows.

Our efforts to advance the state of the art in humanbehaviour sensing have so far mainly focused on visual

behaviour, i.e. mobile and remote gaze estimation. Theseefforts have been driven by the vision of pervasive gazeestimation, i.e. unobtrusive and continuous gazeestimation in unconstrained daily-life settings [1]. Morespecifically, we have developed new mobile eye trackersbased on Electrooculography and computer vision [3, 10].These systems have significantly pushed the boundaries ofmobile gaze estimation with respect to wearability andrecording duration as well as accessibility, affordability,and extensibility. We have also presented methods for eyetracker self-calibration and for adapting calibrations tonew users [8, 16]. We have also developed computervision methods for remote gaze estimation usingmonocular RGB cameras to open up new usage scenariosand problem settings for gaze interaction. Morespecifically, we have developed methods forcalibration-free gaze estimation on single and acrossmultiple ambient displays [25, 26, 11] as well as for gazeestimation using the cameras readily integrated intohandheld portable devices, such as tablets [23].

More recently, we have started to exploreappearance-based methods that directly learn a mappingfrom eye appearance to on-screen or 3D gaze position.These methods have a number of appealing propertiesthat make them particularly promising for use in daily-lifesettings, such as increased robustness to varyingillumination conditions and camera resolutions.Specifically, we have presented a new large-scale datasetthat we collected during everyday laptop use over morethan three months and that is significantly more variablethan existing ones with respect to appearance andillumination (see Figure 1). We have also presented amethod based on a multimodal deep convolutional neuralnetwork for in-the-wild appearance-based gaze estimationthat significantly outperforms previous methods [24].

Currently, we are exploring learning-by-synthesisapproaches in which we use a large number of synthetic,photo-realistic eye images for pre-training the network.We have demonstrated that this approach significantlyout-performs state-of-the-art methods for eye-shaperegistration as well as our own previous results forappearance-based gaze estimation in the wild [22].

Computational Human Behaviour AnalysisActivity recognition, and in particular recognition of users’activities from their visual behaviour, has been a focus ofour research since several years. In early work we haveshown, for the first time, that everyday activities, such asreading or common office activities, can be predicted inboth stationary and mobile settings with surprisingaccuracy from eye movement data alone [5, 4]. Eyemovements are closely linked to visual informationprocessing, such as perceptual learning and experience,visual search, or fatigue. More recently we have thereforeexplored eye movement analysis as a similarly promisingapproach towards cognition-aware computing: Computingsystems that sense and adapt to covert aspects of userstate that are difficult if not impossible to detect usingother modalities [7]. We have been among the first todemonstrate that selected cognitive states and processescan automatically be predicted from eye movement, suchas visual memory recall [2], concentration [17], orpersonality traits, such as perceptual curiosity [9].

LP1 fW3 R S2

LDA

Topicactivations

Encoding

t

Bag-of-wordsrepresentation

Figure 2: Unsuperviseddiscovery of everyday activitiesfrom visual behaviour using alatent Dirichlet allocation(LDA) topic model.

Despite significant advances in analysing andunderstanding human visual behaviour, the majority ofprevious works have focused on short-term behaviourlasting only seconds or minutes. We have contributed thefirst work on supervised recognition of high-levelcontextual cues, such as social interactions or being in oroutside, from long-term visual behaviour [6]. Recently, we

have extended that work with a new method forunsupervised discovery of everyday activities [15]. Wehave presented a method that combines a bag-of-wordsrepresentation of visual behaviour with a latent Dirichletallocation (LDA) topic model (see Figure 2) as well as anovel long-term gaze dataset that contains full-dayrecordings of natural visual behaviour of 10 participants(more than 80 hours in total).

Symbiotic Human-Machine Vision SystemsWe have been studying human-computer interaction usinggaze for several years. Starting from more classicalproblem settings in gaze-based interaction, such asinteraction techniques for object selection, manipulation,and transfer across display boundaries [19, 18]. We havebeen particularly interested in extending the scope of gazeinteraction into everyday settings and in making theseinteractions more natural. For example, we haveintroduced smooth pursuit eye movements – themovements we perform when latching onto a movingobject – as a natural and calibration-free interactiontechnique for dynamic interfaces [21, 12]. We have alsoproposed social gaze as a new paradigm for designing userinterfaces that react to gaze and eye contact as a form ofnon-verbal communication in a similar way ashumans [20]. While gaze has a long history as a modalityin human-computer interaction, in these works we havetaken a fresh look at it and have demonstrated that thereis much more to gaze than traditional but limitedon-screen gaze location and dwelling.

State-of-the-art computer vision systems stillunder-perform on many visual tasks when compared tohumans. We believe that collaborative vision systems thatcombine the advantages of machine and humanperception and reasoning can bridge this performance gap.

We have recently started to explore both directions ofcollaborative vision, i.e. means of improving performanceof computer vision algorithms by incorporating informationfrom human fixations and vice versa. More specifically, wehave propose an early integration approach of humanfixation information into a deformable part model (DPM)for object detection [14]. We have demonstrated that ourGazeDPM method outperforms state-of-the-art DPMbaselines and that it provides introspection of the learntmodels, can reveal salient image structures, and allows usto investigate the interplay between gaze attracting andrepelling areas. In another work we have focused onpredicting the target of visual search from humanfixations [13]. In contrast to previous work we havestudied a challenging open-world setting in which we nolonger assumed that we have fixation data to train for thesearch targets. Both of these works as well as anincreasing number of works by others in computer visionand machine learning point at the significant potential ofintegrating human and machine vision symbiotically.

ConclusionIn this work we have motivated the importance andpotential of non-verbal behavioural cues, in particulargaze and body language, as a trailblazer for a new class ofcollaborative human-machine systems that are highlyinteractive, multimodal, and modelled after naturalhuman-human interactions. We have outlined three keyresearch challenges for realising this vision inunconstrained everyday settings: pervasive visual andphysical human behaviour sensing, computational humanbehaviour analysis, and symbiotic human-machine visionsystems. We have provided an overview of our previousand ongoing efforts to address these challenges, and wehave highlighted individual works that represent andillustrate the state of the art in the respective area.

REFERENCES1. Bulling, A., and Gellersen, H. Toward Mobile

Eye-Based Human-Computer Interaction. IEEEPervasive Computing 9, 4 (2010), 8–12.

2. Bulling, A., and Roggen, D. Recognition of VisualMemory Recall Processes Using Eye MovementAnalysis. In Proc. UbiComp (2011), 455–464.

3. Bulling, A., Roggen, D., and Troster, G. WearableEOG goggles: Seamless sensing andcontext-awareness in everyday environments. Journalof Ambient Intelligence and Smart Environments 1, 2(2009), 157–171.

4. Bulling, A., Ward, J. A., and Gellersen, H.Multimodal Recognition of Reading Activity inTransit Using Body-Worn Sensors. ACM Transactionson Applied Perception 9, 1 (2012), 2:1–2:21.

5. Bulling, A., Ward, J. A., Gellersen, H., and Troster, G.Eye Movement Analysis for Activity RecognitionUsing Electrooculography. IEEE Transactions onPattern Analysis and Machine Intelligence 33, 4 (Apr.2011), 741–753.

6. Bulling, A., Weichel, C., and Gellersen, H. Eyecontext:Recognition of high-level contextual cues from humanvisual behaviour. In Proc. CHI (2013), 305–308.

7. Bulling, A., and Zander, T. O. Cognition-awarecomputing. IEEE Pervasive Computing 13, 3 (July2014), 80–83.

8. Fehringer, B., Bulling, A., and Kruger, A. Analysingthe potential of adapting head-mounted eye trackercalibration to a new user. In Proc. ETRA (2012),245–248.

9. Hoppe, S., Loetscher, T., Morey, S., and Bulling, A.Recognition of Curiosity Using Eye MovementAnalysis. In Adj. Proc. UbiComp (2015).

10. Kassner, M., Patera, W., and Bulling, A. Pupil: anopen source platform for pervasive eye tracking andmobile gaze-based interaction. In Adj. Proc. UbiComp(2014), 1151–1160.

11. Lander, C., Gehring, S., Kruger, A., Boring, S., andBulling, A. GazeProjector: Accurate Gaze Estimationand Seamless Gaze Interaction Across MultipleDisplays. In Proc. UIST (2015).

12. Pfeuffer, K., Vidal, M., Turner, J., Bulling, A., andGellersen, H. Pursuit calibration: Making gazecalibration less tedious and more flexible. In Proc.UIST (2013), 261–270.

13. Sattar, H., Muller, S., Fritz, M., and Bulling, A.Prediction of search targets from fixations inopen-world settings. In Proc. CVPR (2015), 981–990.

14. Shcherbatyi, I., Bulling, A., and Fritz, M. GazeDPM:Early Integration of Gaze Information in DeformablePart Models. arxiv:1505.05753, 2015.

15. Steil, J., and Bulling, A. Discovery of everyday humanactivities from long-term visual behaviour using topicmodels. In Proc. UbiComp (2015).

16. Sugano, Y., and Bulling, A. Self-calibratinghead-mounted eye trackers using egocentric visualsaliency. In Proc. UIST (2015).

17. Tessendorf, B., Bulling, A., Roggen, D., Stiefmeier,T., Feilner, M., Derleth, P., and Troster, G.Recognition of hearing needs from body and eyemovements to improve hearing instruments. In Proc.Pervasive (2011), 314–331.

18. Turner, J., Alexander, J., Bulling, A., and Gellersen,H. Gaze+rst: Integrating gaze and multitouch forremote rotate-scale-translate tasks. In Proc. CHI(2015), 4179–4188.

19. Turner, J., Bulling, A., Alexander, J., and Gellersen,H. Cross-device gaze-supported point-to-pointcontent transfer. In Proc. ETRA (2014), 19–26.

20. Vidal, M., Bismuth, R., Bulling, A., and Gellersen, H.The Royal Corgi: Exploring Social Gaze Interaction forImmersive Gameplay. In Proc. CHI (2015), 115–124.

21. Vidal, M., Bulling, A., and Gellersen, H. Pursuits:Spontaneous interaction with displays based onsmooth pursuit eye movement and moving targets. InProc. UbiComp (2013), 439–448.

22. Wood, E., Baltrusaitis, T., Zhang, X., Sugano, Y.,Robinson, P., and Bulling, A. Rendering of eyes foreye-shape registration and gaze estimation.arxiv:1505.05916, 2015.

23. Wood, E., and Bulling, A. Eyetab: Model-based gazeestimation on unmodified tablet computers. In Proc.ETRA (2014), 207–210.

24. Zhang, X., Sugano, Y., Fritz, M., and Bulling, A.Appearance-based gaze estimation in the wild. InProc. CVPR (2015), 4511–4520.

25. Zhang, Y., Bulling, A., and Gellersen, H. Sideways: Agaze interface for spontaneous interaction withsituated displays. In Proc. CHI (2013), 851–860.

26. Zhang, Y., Chong, M. K., Muller, J., Bulling, A., andGellersen, H. Eye tracking for public displays in thewild. Personal and Ubiquitous Computing (2015),1–15.

human visual behaviour for collaborative human-machine ... · we have been studying human-computer...

Documents