jenny benois-pineau, labri – université de bordeaux – cnrs umr 5800/ university bordeaux1

43
Egocentric vision form wearable cameras for studies of neurodegenerative diseases Visual attention maps Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR 5800/ University Bordeaux1 H. Boujut, V. Buso, L. Letoupin Ivan Gonsalez- Diaz(University Bordeaux1) Y. Gaestel, J.-F. Dartigues (INSERM)

Upload: korbin

Post on 23-Feb-2016

175 views

Category:

Documents


0 download

DESCRIPTION

Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR 5800/ University Bordeaux1 H. Boujut , V. Buso , L. Letoupin Ivan Gonsalez-Diaz( University Bordeaux1) Y. Gaestel , J.-F. Dartigues (INSERM). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Egocentric vision form wearable cameras for studies of neurodegenerative diseasesVisual attention maps

Jenny Benois-Pineau,

LaBRI – Université de Bordeaux – CNRS UMR 5800/ University Bordeaux1

H. Boujut, V. Buso, L. Letoupin Ivan Gonsalez-Diaz(University Bordeaux1)

Y. Gaestel, J.-F. Dartigues (INSERM)

Page 2: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Summary1. Introduction and motivation2. Wearable video 3. Visual attention/Salincy Maps from wearable

video application to recognition of manipulated objects.

4. Perspectives

2

Page 3: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

3

Introduction and motivation

• Recognition of Instrumental Activities of Daily Living (IADL) of patients suffering from Alzheimer Disease

• Decline in IADL is correlated with future dementia

• IADL analysis:

• Survey for the patient and relatives → subjective answers

• Observations of IADL with the help of video cameras worn by the patient at home

• Objective observations of the evolution of disease

• Adjustment of the therapy for each patient: IMMED ANR, Dem@care IP FP7 EC

Page 4: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

ContextProjects:- ANR Blanc IMMED 2009 – 2012LABRI, IMS, ISPED, IRIT- EU FP7 PI Dem@care 2011 – 2015- ITI-CERTH (Gr), INRIA Sophia, UBx1 ( LABRI, IMS), CHUN/ISPED,

LTU(Sw), DCU/DCU memory clinic (Ir), Cassidian (Fr), Philipps(NL),- VISTEK ISRA vision (T).

- General trend : computer vision and multimedia indexing for healthcare applications (USA – a lot, Georgia Tech, Harward, USC etc…) as non-intrusive, écological

4

Page 5: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

5

2. Wearable videos• Video acquisition setup

• Wide angle camera on shoulder

• Non intrusive and easy to use device

• IADL capture: from 40 minutes up to 2,5 hours

• Natural integration into home visit by paramedical assistants protocol

(c)

Loxie – ear-wearedLooking-glaces weared with eye-tracker ( eyebrain(?)

Page 6: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

6

Wearable videos• 4 examples of activities recorded with this camera:• Making the bed, Washing dishes, Sweeping, Hovering

Page 7: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

4. Visual attention/Saliency Maps from wearable video application to recognition of manipulated objects

• Introduction• State of the art• Saliency Modeling• Viewpoint: Actor vs. Observer• Object recognition in egocentric videos with saliency• Results• Conclusion

7

Page 8: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

INTRODUCTION• Object recognition (Dem@care IP FP7 EU

Funded, 7M€)

• From wearable camera

• Egocentric viewpoint

• Manipulated objects from activities of daily living

8

Page 9: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

WINDOW SEARCH: A COMMON APPROACH

9

Objectness [1][2]

Measure to quantify how likely it is for an image window to contain an object.

[1] Alexe, B., Deselares, T. and Ferrari, V. What is an object? CVPR 2010.[2] Alexe, B., Deselares, T. and Ferrari, V. Measuring the objectness of image windows PAMI 2012.

Page 10: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

OBJECT RECOGNITION WITH SALIENCY• Many objects may be present in the camera field

• How to consider the object of interest?

• Our proposal: By using visual saliency

10

IIMMED DB

Page 11: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

OUR APPROACH : MOLELING VISUAL ATTENTION

11

• Several approaches• Bottom-up or top-down• Overt or covert attention• Spatial or spatio-temporal• Scanpath or pixel-based saliency

• Features• Intensity, color, and orientation (Feature Integration Theory [1]),

HSI or L*a*b* color space• Relative motion [2]

• Plenty of models in the literature• In their 2012 survey, A. Borji and L. Itti [3] have taken the

inventory of 48 significant visual attention methods[1] Anne M. Treisman & Garry Gelade. A feature-integration theory of attention. Cognitive Psychology, vol. 12, no. 1, pages 97–136, January 1980.[2] Scott J. Daly. Engineering Observations from Spatiovelocity and Spatiotemporal Visual Models. In IS&T/SPIE Conference on Human Vision and Electronic Imaging III, volume 3299, pages 180–191, 1 1998.[3] Ali Borji & Laurent Itti. State-of-the-art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, no. PrePrints, 2012.

Page 12: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

STATE OF THE ARTSaliency based approach

- [1] performed a recent comparison of action recognition performances using Saliency-based modification of the BOW framework on Hollywood2 dataset (videos extracted from movies)

- Results outpermorm previous state of the art with a 61,9% rate of action recognition (previously 58,3%)

[1] E. Vig, M. Dorr, and D. Cox. Space-variant descriptor sampling for action recognition based on saliency and eye movements. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision ECCV 2012

12

Fig. 1. 6 different saliency masks from left to right, top row: unmasked, center mask, empirical saliency mask (eye-tracker), bottom row: analytical saliency mask, peripheral mask, off set empirical mask

Page 13: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

“OBEJCTIVE”/AUTOMATIC SALIENCY FROM VIDEO: ITTI’S MODEL

13

• The most widely used model

• Designed for still images

• Does not consider the temporal dimension of videos

[1] Itti, L.; Koch, C.; Niebur, E.; , "A model of saliency-based visual attention for rapid scene analysis , »Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.20, no.11, pp.1254-1259, Nov 1998

Page 14: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

SPATIOTEMPORAL SALIENCY MODELING

14

Video Frames

Global Motion Estimation

(GME)

Temporal filtering

Temporal Saliency Map

Video Frames

Spatial Filtering Normalization Spatial

Saliency Map

FusionSpatio-

temporal Saliency Map

• Most of spatio-temporal bottom-up methods work in the same way[1], [2]

• Extraction of the spatial saliency map (static pathway)

• Extraction of the temporal saliency map (dynamic pathway)

• Fusion of the spatial and the temporal saliency maps (fusion)

[1] Olivier Le Meur, Patrick Le Callet & Dominique Barba. Predicting visual fixations on video based on low-level visual features. Vision Researchearch, vol. 47, no. 19, pages 2483–2498, Sep 2007.[2] Sophie Marat, Tien Ho Phuoc, Lionel Granjon, Nathalie Guyader, Denis Pellerin & Anne Guérin-Dugué. Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, vol. 82, no. 3, pages 231–243, 2009. Département Images et Signal.

Page 15: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

SPATIAL SALIENCY MODEL

15

• Based on the sum of 7 color contrast descriptors in HSI domain [1][2]

• Saturation contrast• Intensity contrast• Hue contrast• Opposite color contrast• Warm and cold color contrast• Dominance of warm colors• Dominance of brightness and hue

The 7 descriptors are computed for each pixels of a frame I using the 8 connected neighborhood.

The spatial saliency map is computed by:

Finally, is normalized between 0 and 1 according to its maximum value

[1] M.Z. Aziz & B. Mertsching. Fast and Robust Generation of Feature Maps for Region-Based Visual Attention. Image Processing, IEEE Transactions on, vol. 17, no. 5, pages 633 –644, may 2008.[2] Olivier Brouard, Vincent Ricordel & Dominique Barba. Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif. In Compression et representation des signaux audiovisuels, CORESA 2009, page 6 pages, Toulouse, France, March 2009.

Page 16: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

TEMPORAL SALIENCY MODEL

16

The temporal saliency map is extracted in 4 steps [Daly 98][Brouard et al. 09][Marat et al. 09]

The optical flow is computed for each pixel of frame i.The motion is accumulated in and the global motion is

estimated.The residual motion is computed:

Finally, the temporal saliency map is computed by filtering the amount of residual motion in the frame.

with , and

Page 17: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Video Frames

Global Motion Estimation

(GME)

Camera center motion

estimation

Geometric Saliency Map

Video Frames

Global Motion Estimation

(GME)

Temporal filtering

Temporal Saliency Map

Video Frames

Spatial Filtering Normalization Spatial

Saliency Map

SALIENCY MODEL IMPROVEMENT

17

• Spatio-temporal saliency models were designed for edited videos

• Not well suited for unedited egocentric video streams• Our proposal:

• Add a geometric saliency cue that considers the camera motion anticipation

FusionSpatio-

temporal Saliency Map

1. H. Boujut, J. Benois-Pineau, and R. Megret. Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In A. Fusiello, V. Murino, and R. Cucchiara, editors, Computer Vision ECCV 2012, IFCV WS

Page 18: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

GEOMETRIC SALIENCY MODEL

18

• 2D Gaussian was already applied in the literature [1]

• “Center bias”, Busswel, 1935 [2]• Suitable for edited videos

• Our proposal:• Train the center position as a

function of camera position• Move the 2D Gaussian center

according to camera center motion.

• Computed from the global motion estimation

• Considers the anticipation phenomenon [Land et al.].

Geometric saliency map[1] Tilke Judd, Krista A. Ehinger, Frédo Durand & Antonio Torralba.Learning to predict where humans look. In ICCV, pages 2106–2113. IEEE,2009.[2] Michael Dorr, et al. Variability of eye movements when viewing dynamic natural scenes. Journal of Vision (2010), 10(10):28, 1-17

Page 19: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

GEOMETRIC SALIENCY MODEL

19

• The saliency peak is never located on the visible part of the shoulder

• Most of the saliency peaks are located on the 2/3 at the top of the frame

• So the 2D Gaussian center is set at:

Saliency peak on frames fromall videos of the eye-tracker experiment

Page 20: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

SALIENCY FUSION

20

Several fusion methods for pooling spatio-temporal saliency cues already exists in the literature (without geometric saliency) [1], [2].

We have tested three fusion methods on wearable video database: Log sum fusion

Squared sum fusion (only on GTEA)

Multiplicative

[1] Sophie Marat, Tien Ho Phuoc, Lionel Granjon, Nathalie Guyader, Denis Pellerin & Anne Guérin-Dugué. Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, vol. 82, no. 3, pages 231–243, 2009. Département Images et Signal.[2] H. Boujut, J. Benois-Pineau, T. Ahmed, O. Hadar, and P. Bonnet, "A Metric For No Reference video quality assessment for HD TV Delivery based on Saliency Maps," ICME 2011, Workshop on Hot Topics in Multimedia Delivery, Jul. 2011.

Page 21: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

SALIENCY FUSION

21

Frame Spatio-temporal-geometricsaliency map

Subjective saliency map

Page 22: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

VISUAL ATTENTION MAPS : SUBJECTIVE SALIENCY

22

D. S. Wooding method, 2002(was tested over 5000 participants)

Eye fixationsfrom the eye-tracker

+

Subjectivesaliency map

2D Gaussians(Fovea area = 2° spread)

Page 23: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

SUBJECTIVE SALIENCY

23

Page 24: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

HOW PEOPLE – OBSERVERS -WATCH VIDEOS FROM WEARABLE CAMERA?

24

• Psycho-visual experiment

• Gaze measure with an Eye-Tracker (Cambridge Research Systems Ltd. HS VET 250Hz)

• 31 HD video sequences from IMMED database.• Duration 13’30’’• 25 subjects (5 discarded)• 6 562 500 gaze positions recorded• We noticed that subject anticipate camera motion

Page 25: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

EVALUATION ON IMMED DB

25

Normalized Saliency Scanpath (NSS) correlation method Metric

Comparison of:Baseline spatio-temporal saliency

Spatio-temporal-geometric saliency without camera motion

Spatio-temporal-geometric saliency with camera motion

Results:Up to 50% better than spatio-

temporal saliencyUp to 40% better than spatio-

temporal-geometric saliency without camera motion H. Boujut, J. Benois-Pineau, R. Megret: « Fusion of Multiple

Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion ». ECCV Workshops (3) 2012: 436-445

Page 26: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

EVALUATION ON GTEA DB• IADL dataset• 8 videos, duration 24’43’’• Eye-tracking mesures for actors and observers

• Actors (8 subjects)• Observers (31 subjects) 15 subjects have seen each video

• SD resolution video at 15 fps recorded with eyetracker glasses

26A. Fathi, Y. Li, and J. Rehg. Learning to recognize daily actions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision ECCV 2012

Page 27: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

EVALUATION ON GTEA DB• “Center bias” : camera on looking glaces, head

mouvement compensates• Correlation is database dependent

27Objective vs. Subjective -viewer

Page 28: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

VISUAL ATTENTION MAPS IN ACTIONS : ACTOR VS. OBSERVER

28Actor Observer

vs.

vs.

GTE

A da

tase

t

Gaze focused on next action

Page 29: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

VIEWPOINT: ACTOR VS. OBSERVER (CONT.)

29

Tea-making: average relative timings of body movements, object fixation, and object manipulation [1]

Time relationships of vision and motor acts [1][2]

[1] Land, M. F., Mennie, N., & Rusted, J. (1999). The roles of vision and eye movements in the control of activities of daily living. Perception, 28, 1311–1328.[2] C. Prablanc, J. Echailler, E. Komilis, and M. Jeannerod. Optimal response of eye and hand motor systems in pointing at a visual target. Biol. Cybernetics, 35:113–124, 1979.

Page 30: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Distributions of quality metrics

30

Page 31: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

31

Page 32: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

32

Page 33: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

VIEWPOINT: ACTOR VS. OBSERVER (cont.)

33

0 5 10 15 20 25 30 35 40 45 50 550

0.10.20.30.40.50.60.70.80.9

Moyenne de NSSMoyenne de AUCMoyenne de PCC

Time shift (frames)

Actor saliency correlation with Viewer saliency(GTEA database / 8 videos at 15 fps / 31 subjects)

Wilcoxon test : OK

Page 34: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Conclusion 1A time shift is identified in wearable video between

actor’s and observer’s point –of-view at the beginning of actions.

The value is around 500 ms – this confirms the findings by M. Land (1999) and C. Prablanc(1978) for elementary grasping actions.

From wearable video we can - Confirm the known bio-physical fact : the actor

anticipates the grasping action. He foveates the object –of-interest before the action.

34

Page 35: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Conclusion : actor vs observer(1)- The observer is focused on an action, his visual

attention is delayed in time at the beginning of an action

- The visual attention maps of viewers vs observers, NC vs AD, NC vs PD subjects can be compared in order to identify and quantify the deviation.

- The NC observers visual attention maps ( subjective saliency) can be reasonably predicted by existing signal-based models. Therefore, only measurement of Actor’s Visual attention map would be necessary.

35

Page 36: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

OBJECT RECOGNITION WITH PREDICTED SALIENCY MAPS

36

Mask computation

Local PatchDetection & Description

BoWcomputation

Visual vocabulary

ImageMatching

Supervised classifier

Image retrieval

Object recognition

Spatially constrained approach using saliency methods

W

Page 37: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

GTEA DATASET• GTEA [1] is an ego-centric video dataset, containing 7 types of daily

activities performed by 14 different subjects. • The camera is mounted on a cap worn by the subject.• Scenes show 15 object categories of interest.• Data split into a training set (294 frames) and a test set (300

frames)

37

Categories in GTEA dataset with the number of positives in train/test sets

[1] Alireza Fathi, Yin Li, James M. Rehg, Learning to recognize daily actions using gaze, ECCV 2012.

Page 38: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

ASSESSMENT OF VISUAL SALIENCY IN OBJECT RECOGNITION

We tested various parameters of the model:• Various approaches for local region sampling.

1. Sparse detectors (SIFT, SURF)2. Dense Detectors (grids at different granularities).

• Different options for the spatial constraints:1. Without masks: global BoW2. Ideal manually annotated masks.3. Saliency masks: geometric, spatial, fusion schemes…

• In two tasks:1. Object retrieval (image matching): mAP ~ 0.452. Object recognition (learning): mAP ~ 0.91

38

Page 39: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

OBJECT RECOGNITION WITH SALIENCY MAPS

39

Page 40: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

OBJECT RECOGNITION WITH SALIENCY MAPS

40The best: Ideal, Geometric, Squared-with-geometric, Actors’ is low

Page 41: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

CONCLUSION -2• Proposed automatic / “objective” observers’s

saliency maps are as good as subjective visual attention maps of observers in the tasks of automatic object recognition

• Time-shifted gaze correlation between actors and observers at the beginning of actions

• Perspective : Study of « normal » and « abnormal » saliency maps in video for patients with various neurodegenerative diseases.

Automatic Prediction of « normal » saliency maps 41

Page 42: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Perspectives Fusion of multimple media cues : Video, audio accelerometers, gyroscopes : ParkinsonSTIC – projet –region soumis, TECSAN soumis,

projet UBx1 – acquis(Problème de reconnaissance du contexte des chutes chez

les malades Parkinsoniens)

Visual saliency : Study of « normal » and « abnormal » saliency maps in video for patients with various neurodegenerative diseases.

Automatic Prediction of « normal » saliency maps

42

Page 43: Jenny Benois-Pineau, LaBRI – Université de Bordeaux – CNRS UMR  5800/  University  Bordeaux1

Acknowledgments PUPH. Jean-François Dartigues, INSERMEm. Pr. Dominique Barba, IRCCyN/University of

NantesPr. Mark L. Latash, PSU(USA)

43