![Page 1: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/1.jpg)
Pr. Jenny Benois-‐Pineau,
LaBRI – Université de Bordeaux – CNRS UMR 5800/ University Bordeaux/IPB-‐ENSEIRB
H. Boujut, V. Buso, L. Letoupin, R. Megret, S. Karaman, V. Dovgalecs, (UBx), I. Gonsalez-‐Diaz( UHCM)
Y. Gaestel, J.-‐F. DarKgues (INSERM)
![Page 2: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/2.jpg)
Summary 1. Introduction and motivation 2. Egocentric video 3. Fusion of multiple cues in H-HMM for IADL
Recognition 4. Object recognition with saliency Maps from
egocentric video. 5. Conclusion and perspectives
2
![Page 3: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/3.jpg)
1. Introduction and motivation - Egocentric vision since mainly 2000th - Objective assesment of capacities on
instrumental activiteis of daily living of persons with Dementia
- Thus recognition of activities and Manipulated objects in EC video
- Multidiscipplinary projects : IT /Multimedia Reseach and Medical Research
ANR IMMED https://immed.labri.fr/ IP FP7 EU Dem@care http://www.demcare.eu/ -
3
![Page 4: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/4.jpg)
4
2. Egocentric Video • Video acquisition setup
• Wide angle camera on shoulder
• Non intrusive and easy to use device
• IADL capture: from 40 minutes up to 2,5 hours
(c)
![Page 5: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/5.jpg)
5
Egocentric Video • 4 examples of activities recorded with this camera: • Making the bed, Washing dishes, Sweeping, Hovering
![Page 6: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/6.jpg)
6
Audio descriptors
Motion estimation
Temporal partitioning
Motion descriptors
Localisation
CLD
Place annotation
video data
Audio data
Descriptorsfusion
HHMM
Activities
3. Fusion of multiple cues in H-HMM for IADL Recognition 3.1 General architecture
![Page 7: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/7.jpg)
7
3.2.Temporal Partitioning(1) • Pre-processing: preliminary step towards activities recognition • Objectives:
• Reduce the gap between the amount of data (frames) and the target number of detections (activities)
• Associate one observation to one viewpoint • Principle:
• Use the global motion e.g. ego motion to segment the video in terms of viewpoints
• One key-frame per segment: temporal center
• Rough indexes for navigation throughout this long sequence shot
• Automatic video summary of each new video footage
![Page 8: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/8.jpg)
• Complete affine model of global motion (a1, a2, a3, a4, a5, a6)
[Krämer et al.] Camera Motion Detection in the Rough Indexing Paradigm, TREC’2005.
• Principle:
• Trajectories of corners from global motion model • End of segment when at least 3 corners trajectories have
reached outbound positions
8
Temporal Partitioning(2)
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
i
i
i
i
y
x
aa
aa+
a
a=
dy
dx
65
32
4
1
![Page 9: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/9.jpg)
9
• Threshold t defined as a percentage p of image width w p=0.2 … 0.25
wp=t ×
Temporal Partitioning(3)
![Page 10: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/10.jpg)
10
Temporal Partitioning(4) Video Summary
• 332 key-frames, 17772 frames initially • Video summary (6 fps)
![Page 11: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/11.jpg)
11
• Color: MPEG-7 Color Layout Descriptor (CLD)
6 coefficients for luminance, 3 for each chrominance • For a segment: CLD of the key-frame, x(CLD) ∈ ℜ12
• Localization: feature vector adaptable to individual home environment.
• Nhome localizations. x(Loc) ∈ ℜNhome • Localization estimated for each frame • For a segment: mean vector over the frames within the segment • Audio: x(Audio) : probabilistic features SMN… V. Dovgalecs, R. Mégret, H. Wannous, Y. Berthoumieu. "Semi-Supervised Learning for Location Recognition from Wearable
Video". CBMI’2010, France.
3.3 Description space: fusion of features (1)
![Page 12: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/12.jpg)
12
Audio
Speech detection
4 Hz energymodulation
Speech probability
Fusion (scores)
Entropymodulation
Segmentation
Music detection
Segment duration
Music probability
Fusion (scores)
Number of segments
Silence detection
Energy Silence probability
Noise detection
MFCC with GMM
Spectral cover
Water flow and vacuum cleanerprobabilities
Other noise probabilities
7 audio confidence indicators
Description space(2). Audio
J. Pinquier, IRIT
![Page 13: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/13.jpg)
• Htpe log-scale histogram of the translation parameters energy Characterizes the global motion strength and aims to distinguish activities with strong or low motion
• Ne = 5, sh = 0.2. Feature vectors x(Htpe,a1) and x(Htpe,a4) ∈ ℜ5
• Histograms are averaged over all frames within the segment
x(Htpe, a1) x(Htpe,a4)
Low motion segment 0,87 0,03 0,02 0 0,08 0,93 0,01 0,01 0 0,05
Strong motion segment 0,05 0 0,01 0,11 0,83 0 0 0 0,06 0,94
Description space(3). Motion
13
ehtpe
ehhtpe
htpe
N=iforsi)(aif[i]H
N=iforsi<)(as)(iif[i]H
=iforsi<)(aif[i]H
×≥
−×≤×−
×
2
2
2
log1=+
12..log11=+
1log1=+
![Page 14: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/14.jpg)
14
• Hc: cut histogram. The ith bin of the histogram contains the number of temporal segmentation cuts in the 2i last frames
Hc[1]=0, Hc[2]=0, Hc[3]=1, Hc[4]=1, Hc[5]=2, Hc[6]=7
• Average histogram over all frames within the segment
• Characterizes the motion history, the strength of motion even outside the current segment
26=64 frames → 2s, 28=256 frames → 8.5s x(Hc) ∈ ℜ6 or ℜ8
• Residual motion 4x4 x(RM) ∈ ℜ16
Description space(4). Motion
![Page 15: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/15.jpg)
Description space(5). Place recognition Granularity of visual recogniFon
Room recogniKon [Dovgalecs 2010] Topological posiFoning [O’Conaire 2009] Metric 3D posiKoning [Wannous 2012]
UB1 - Wearable camera video analysis
V. Dovgalecs et al. "Semi-Supervised Learning for Location Recognition from Wearable Video". CBMI 2010. O’Conaire et al., SenseCam Image Localisation using Hierarchical SURF Trees, MMM’2009 H. Wannous et al., Place Recognition via 3D Modeling for Personal Activity Lifelog using Wearable Camera, MMM'2012
Metric 3D position Topology / map
Availability, Sensitivity Lightweight modeling
Precision, Specificity Heavy modeling
Room
![Page 16: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/16.jpg)
Algorithm Feature space
Temporal accumulation
without with
SVM BOVW 0.49 0.52 CRFH 0.48 0.53 SPH 0.47 0.49
Early-fusion BOVW+SPH 0.48 0.50 BOVW+CRFH 0.50 0.54 CRFH+SPH 0.48 0.51
Late-fusion BOVW+SPH 0.51 0.56 BOVW+CRFH 0.51 0.56 CRFH+SPH 0.50 0.54
Co-training with late-fusion
BOVW+SPH 0.50 0.53 BOVW+CRFH 0.54 0.58 CRFH+SPH 0.54 0.57
EvaluaKon of Room RecogniKon On the IMMED dataset @Home
Difficult dataset due to the low amount of training data for each locaFon (5 minutes bootstrap)
Conclusions Best performances obtained using late-‐
fusion approaches with temporal accumulaFon
UB1 - Wearable camera video analysis
Average accuracy for room recognition on IMMED dataset.
V. Dovgalecs, R. Megret, and Y. Berthoumieu. Multiple feature fusion based on Co-Training approach and time regularization for place classification in wearable video. Advances in Multimedia. In Press.
![Page 17: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/17.jpg)
Early fusion of all features
17
Dynamic Features
Static Features
Audio Features
Combinationof
features
HMM Classifier
Activities Media
![Page 18: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/18.jpg)
Description space(6)
l ♯Possible combinations of descriptors : 2⁶ - 1 = 63
Descriptors Audio Loc RM Htpe Hc CLD configmin
configmax
Dimensions 7 7 16 10 8 12 7 60
![Page 19: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/19.jpg)
19
3.4 Model of the content
• Multiple levels
• Computational cost/Learning
• QD={qid} states set
• = initial probability
of child qjd+1 of state qi
d
• Aijqd = transition probabilities
between children of qd
)(qΠ +djdiq 1
HMMs: efficient for classification with temporal causality An activity is complex, it can hardly be modeled by one single state Hierarchical HMM? [Fine98], [Bui04]
![Page 20: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/20.jpg)
20
A two level hierarchical HMM:
• Higher level:
transition between activities • Example activities: Washing the dishes, Hovering, Making coffee, Making tea...
• Bottom level:
activity description
• Activity: HMM with 3/5/7/8 states • Observations model: GMM • Prior probability of activity
Model of the content: activities recognition
![Page 21: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/21.jpg)
21
Bottom level HMM
• Start/End
→ Non emitting state
• Observation x only for emitting states qi • Transitions probabilities and GMM parameters are learnt by Baum-Welsh algorithm • A priori fixed number of states
• HMM initialization:
• Strong loop probability aii
• Weak out probability aiend
Activities recognition
![Page 22: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/22.jpg)
22
3.5. Results Video Corpus
Corpus Healthy volonteers/ Patients Number of videos Duration
IMMED 12 healthy volonteers 15 7H16
IMMED 42 patients 46 17H04
TOTAL 12 healthy volonteers + 42 patients 61 24H20
![Page 23: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/23.jpg)
Evaluation protocol
precision= TP/(TP+FP) recall= TP/(TP+FN)
accuracy= (TP+TN)/(TP
+FP+TN+FN)
F-score= 2/(1/precision+1/
recall)
23
• Leave-one-out cross validation scheme (one video left) • Resultas are averaged • Training is performed over a sub sampling of smoothed
(10 frames) data. • Label of a segment is derived by majority vote of frames
results
![Page 24: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/24.jpg)
Recognition of Activities 5 videos. Descriptors?
24
3states LL HMM, 1 state None
![Page 25: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/25.jpg)
Comparison with a GMM baseline (23 activities on 26 videos)
25
![Page 26: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/26.jpg)
« Hovering - specific audio description»
26
![Page 27: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/27.jpg)
Conclusion on early fusion Early fusion requires tests of numerous
combinations of features. The best results were achieved for the complete
description space For specific activities optimal combinations of
descriptors vary and correspond to « common sense » approach.
S. Karaman, J. Benois-Pineau , V. Dovgalecs, R. Megret, J. Pinquier, R. André-Obrecht, Y. Gaestel,
J.-F. Dartigues, « Hierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia », Multimedia Tools and Applications, Springer, 2012, pp. 1-29, DOI 10.1007/s11042-012-117-x Article in Press
27
![Page 28: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/28.jpg)
Intermediate fusion Treat the different modalities (Dynamic, Static,
Audio) separately. We represent each modality by a stream, that is a
set of measures along the time. Each state of the HMM models the observations of
each stream separately by a Gaussian mixture. K streams of observations
28
!!,!,… , !!,! !!,! ∊ ℝ!!
![Page 29: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/29.jpg)
Late Fusion
29
Dynamic Features
Static Features
Audio Features
MEDIA
CombinationFusion of scores of classifiers features
HMM Classifi
er
Activities HMM Classifi
er
HMM Classifi
er
Performance measure of classifiers : modality k, activity l
![Page 30: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/30.jpg)
Experimental video corpus in 3-fusion experiment 37 videos recorded by 34 persons (healthy
volunteers and patients) for a total of 14 hours of content.
30
![Page 31: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/31.jpg)
Results of 3- fusion experiment(1)
31
![Page 32: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/32.jpg)
Results of 3- fusion strategies(2)
32
![Page 33: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/33.jpg)
Conclusion on 3-fusion experiment Overall, the experiments have shown that the
intermediate fusion has provided consistently better results than the other fusion approaches, on such complex data, supporting its use and expansion in future work.
33
![Page 34: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/34.jpg)
4. Object recognition with saliency Maps from wearable video.
• Why ? • Saliency Modeling • Object recognition in egocentric videos with saliency • Results • Conclusion
34
![Page 35: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/35.jpg)
INTRODUCTION • Object recognition
• From wearable camera
• Egocentric viewpoint
• Manipulated objects from activities of daily living
35
![Page 36: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/36.jpg)
OBJECT RECOGNITION WITH SALIENCY • Many objects may be present in the camera field • How to consider the object of interest – “active
object”? • Our proposal: By using visual saliency. This is a
popular subject!
36
IIMMED DB
![Page 37: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/37.jpg)
VISUAL ATTENTION
37
• Several approaches • Bottom-up or top-down • Overt or covert attention • Spatial or spatio-temporal • Scanpath or pixel-based saliency
• Features • Intensity, color, and orientation (Feature Integration Theory [1]),
HSI or L*a*b* color space • Relative motion [2]
• Plenty of models in the literature • In their 2012 survey, A. Borji and L. Itti [3] have taken the
inventory of 48 significant visual attention methods [1] Anne M. Treisman & Garry Gelade. A feature-integration theory of attention. Cognitive Psychology, vol. 12, no. 1, pages 97–136, January 1980. [2] Scott J. Daly. Engineering Observations from Spatiovelocity and Spatiotemporal Visual Models. In IS&T/SPIE Conference on Human Vision and Electronic Imaging III, volume 3299, pages 180–191, 1 1998. [3] Ali Borji & Laurent Itti. State-of-the-art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, no. PrePrints, 2012.
![Page 38: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/38.jpg)
SALIENCY MODEL
38
1. H. Boujut, J. Benois-Pineau, and R. Megret. Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In A. Fusiello, V. Murino, and R. Cucchiara, editors, Computer Vision ECCV 2012, IFCV WS
2. VBuso, J Benois-Pineau, I Gonzalez-Diaz, « Object recognition in egocentric videos with saliency-based non uniform sampling and variable resolution space for features selection, CVPR’2014, 3rd WS on Egocentric (First-person) Vision.
SOA : spatial and temporal, contirbution « egocentric » geometry fusion : least square optimisation wrt OR score : S(p) =αSp(p)+βSt(p)+γSg(p),α +β +γ =1
![Page 39: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/39.jpg)
GEOMETRIC SALIENCY MODEL
39
• The saliency peak is never located on the visible part of the shoulder
• Most of the saliency peaks are located on the 2/3 at the top of the frame
• So the 2D Gaussian center is set at:
Saliency peak on frames from all videos of the eye-tracker experiment
![Page 40: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/40.jpg)
GEOMETRIC SALIENCY MODEL
40
• 2D Gaussian was already applied in the literature [1]
• “Center bias”, Busswel, 1935 [2] • Suitable for edited videos
• Our proposal: • Train the center position as a
function of camera position • Move the 2D Gaussian center
according to camera center motion.
Geometric saliency map
![Page 41: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/41.jpg)
SALIENCY FUSION
41
Frame Spatio-temporal-geometric saliency map
Subjective saliency map
![Page 42: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/42.jpg)
PROPOSED PROCESSING PIPELINE FOR OBJECT RECOGNITION
42
Mask computation
Local Patch Detection & Description
BoW computation
Visual vocabulary
Image Matching
Supervised classifier
Image retrieval
Object recognition
Spatially constrained approach using saliency methods
W
![Page 43: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/43.jpg)
Contribution of specific egocentric cues, CHU Nice dataset
43
CHU Nice dataset 44 videos, 9h 30 min, 17 categiries of active objects, object occurrences 102 ( tea box) – 2032 ( tablet), 22236 frames annotated
![Page 44: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/44.jpg)
CONCLUSION • Activities recognition with fusion of low-level and
mid-level features • Object recognition with saliency models
specifically adapted to egocentric video • Work in progress : new model of activity =
objects + lcation • Fusion of egocentric view and 3rd person view
for activities recognition
44
![Page 45: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/45.jpg)
SPATIAL SALIENCY MODEL
45
• Based on the sum of 7 color contrast descriptors in HSI domain [1][2] • Saturation contrast • Intensity contrast • Hue contrast • Opposite color contrast • Warm and cold color contrast • Dominance of warm colors • Dominance of brightness and hue
The 7 descriptors are computed for each pixels of a frame I using the 8 connected neighborhood.
The spatial saliency map is computed by: Finally, is normalized between 0 and 1 according to its maximum value
[1] M.Z. Aziz & B. Mertsching. Fast and Robust Generation of Feature Maps for Region-Based Visual Attention. Image Processing, IEEE Transactions on, vol. 17, no. 5, pages 633 –644, may 2008. [2] Olivier Brouard, Vincent Ricordel & Dominique Barba. Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif. In Compression et representation des signaux audiovisuels, CORESA 2009, page 6 pages, Toulouse, France, March 2009.
![Page 46: Pr.JennyBenoisPineau,ftmm.eurecom.fr/files/EgocentricMultimediaFor...Object recognition with saliency Maps from egocentric video. 5. Conclusion and perspectives 2 . 1. Introduction](https://reader034.vdocuments.us/reader034/viewer/2022042305/5ed10d383603e925722bf56f/html5/thumbnails/46.jpg)
TEMPORAL SALIENCY MODEL
46
The temporal saliency map is extracted in 4 steps [Daly 98][Brouard et al. 09][Marat et al. 09]
The optical flow is computed for each pixel of frame i. The motion is accumulated in and the global motion is
estimated. The residual motion is computed: Finally, the temporal saliency map is computed by filtering the
amount of residual motion in the frame.
with , and