computational models of human visual attention driven by auditory cues

27
Copyright©2014 NTT corp. All Rights Reserved. Computational models of human visual attention driven by auditory cues Akisato Kimura, Ph.D NTT Communication Science Laboratories (Most of the content presented in this talk are based on the collaborative research with National Institute of Informatics, Japan.)

Upload: akisato-kimura

Post on 18-Jul-2015

225 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Computational models of human visual attention driven by auditory cues

Copyright©2014 NTT corp. All Rights Reserved.

Computational models of human visual attention driven by auditory cues

Akisato Kimura, Ph.D

NTT Communication Science Laboratories

(Most of the content presented in this talk are based on the collaborative research with National Institute of Informatics, Japan.)

Page 2: Computational models of human visual attention driven by auditory cues

1Copyright©2014 NTT corp. All Rights Reserved.

Visual attention

Visual attention is a built-in mechanism of the human visual system for scene understanding.

http://www.tobii.com/eye-tracking-research/global/library/white-papers/tobii-eye-tracking-white-paper/

Page 3: Computational models of human visual attention driven by auditory cues

2Copyright©2014 NTT corp. All Rights Reserved.

Simulating visual attention is essential

Such a pre-selection mechanism would be essential in enabling computers to undertake

HCI

[http://www.icub.org]

Visual assistance

[https://www.google.com/glass]

Object detection

[Donoser et al. 09]

Page 4: Computational models of human visual attention driven by auditory cues

3Copyright©2014 NTT corp. All Rights Reserved.

Saliency as a measure of attention

Saliency = attractiveness of visual attention

• Simple, easy to implement, reasonable outputs

Input image Saliency map [Itti et al. 98]

Estimating human visual focus of attention

Low

High

Page 5: Computational models of human visual attention driven by auditory cues

4Copyright©2014 NTT corp. All Rights Reserved.

Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self-information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

[Itti et al. 98] [Bruce et al. 05]

Input image

Page 6: Computational models of human visual attention driven by auditory cues

5Copyright©2014 NTT corp. All Rights Reserved.

Visual attention modulated by audios

Sounds are strongly related to events that draw human visual attention.

Without audio

With audio

[Song et al.11]

Speaking

Page 7: Computational models of human visual attention driven by auditory cues

6Copyright©2014 NTT corp. All Rights Reserved.

Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

Auditory saliency

• Center-surround mechanism [Kayser 2005]

• Bayesian surprise [Schauerte 2013]

Audio-visual saliency

• Multi-modal saliency for robotics []

• Sound source localization[Nakajima 2013]

[Itti et al. 98] [Bruce et al. 05]

[Kayser et al. 05]Audio spectrogram

Input image

[Itti et al. 03] [Nakajima et al. 13]Input video

Human visual attention models with the help of auditory information is underway.

Page 8: Computational models of human visual attention driven by auditory cues

7Copyright©2014 NTT corp. All Rights Reserved.

Main content of this talk

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive rolein contrast to standard multi-modal fusion approaches

• Our strategy is built on two psychophysical findings

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]

Page 9: Computational models of human visual attention driven by auditory cues

8Copyright©2014 NTT corp. All Rights Reserved.

Our strategy

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]

Following those findings…

1. detect transient events in visual and auditory domains separately

2. look for visual features synchronized withdetected auditory events

3. modulate saliency maps by feature selection.

Page 10: Computational models of human visual attention driven by auditory cues

9Copyright©2014 NTT corp. All Rights Reserved.

Previous method – Bayesian surprise

Image signal

Input video

Bayesian surprise

Visual saliency(from only visual features)

Conventional saliency maps

Page 11: Computational models of human visual attention driven by auditory cues

10Copyright©2014 NTT corp. All Rights Reserved.

Our strategy

Audio signalImage signal

Input video

Bayesian surprise

Visual saliency(from selected visual features)

Auditory Surprise

Selecting visual features synchronized with the auditory events

Modulating saliency mapswith the selected features

Proposed saliency maps

Page 12: Computational models of human visual attention driven by auditory cues

11Copyright©2014 NTT corp. All Rights Reserved.

Bayesian surprise

Audio signalImage signal

Input video

Bayesian surprise

Auditory Surprise

Page 13: Computational models of human visual attention driven by auditory cues

12Copyright©2014 NTT corp. All Rights Reserved.

Concept of Bayesian surprise

Continuously similar features Low saliency values

Unexpected features high saliency values

Page 14: Computational models of human visual attention driven by auditory cues

13Copyright©2014 NTT corp. All Rights Reserved.

Visual Bayesian surprise

Intensity×6Color×12Orientation×24Flicker×6Motion×24

Kullback-Leibler divergence

Prior

Observations = Feature maps

Input video

Gaussian pyramid scale

surprise

Visual surprise for 72 feature maps

Input video Visual surprise

Low

High

PosteriorBayes

72 visual feature maps

[Itti, Vision Research 2009]

Page 15: Computational models of human visual attention driven by auditory cues

14Copyright©2014 NTT corp. All Rights Reserved.

Auditory Bayesian surprise

Spectrograms as an observation

Audio signal

Spectrogram

Auditory surprise

Prior Posterior

Observation at frequency 𝜔

Surprise at frequency 𝜔

Averaged over frequencies

𝜔

Low

High

Auditory surprise

[Schauerte, ICASSP2013]

Page 16: Computational models of human visual attention driven by auditory cues

15Copyright©2014 NTT corp. All Rights Reserved.

Audio-visual synchronization

Bayesian surprise

Auditory Surprise

Selecting visual featuressynchronized with the audio

Visual saliency(from selected visual features)

Page 17: Computational models of human visual attention driven by auditory cues

16Copyright©2014 NTT corp. All Rights Reserved.

Correlation-based detection

Type of features

Time

The window width depends on the length of auditory events

𝜃𝑠

Visual surprise Auditory surprise

Feature 𝑓

Feature 𝑓

Averaging over pixels

Auditory event

360 features

Calculating correlation

Page 18: Computational models of human visual attention driven by auditory cues

17Copyright©2014 NTT corp. All Rights Reserved.

Visual feature selection

Bayesian surprise

Visual saliency(from selected visual features)

Auditory Surprise

Selecting visual features synchronized with the audio

Modulating saliency mapswith the selected features

Proposed saliency maps

Page 19: Computational models of human visual attention driven by auditory cues

18Copyright©2014 NTT corp. All Rights Reserved.

Selecting visual features

Time

Type of features Binarization

Frequency of “synchronization”

Selected features

Voting with threshold 𝜃𝑐

TimeFinal saliency map

Emphasizing selected features by summing up only selected features

360 types of features𝑁 < 360 types of features

Page 20: Computational models of human visual attention driven by auditory cues

19Copyright©2014 NTT corp. All Rights Reserved.

Experimental setup

Detecting scan-paths for ground truth• 15 subjects

• 6 videos (The DIEM project)

• Using Tobii TX300

Evaluation criteria

• Normalized Scanpath Saliency (NSS) [Peters 2009]

• Baseline:• Saliency map model [Itti 2003], Bayesian surprise [Itti 2009],

Sound source localization [Nakajima 2013]

Page 21: Computational models of human visual attention driven by auditory cues

20Copyright©2014 NTT corp. All Rights Reserved.

Experimental results – summary

The proposed model produced best NSS scores for all the videos

Page 22: Computational models of human visual attention driven by auditory cues

21Copyright©2014 NTT corp. All Rights Reserved.

Qualitative evaluation – Video 2

Page 23: Computational models of human visual attention driven by auditory cues

22Copyright©2014 NTT corp. All Rights Reserved.

Qualitative evaluation – Video 2

Input Baseline

Auditory surprise Proposed

Page 24: Computational models of human visual attention driven by auditory cues

23Copyright©2014 NTT corp. All Rights Reserved.

Detailed evaluation – Video 1

Frame

NSS

Surp

rise

NSS(Proposed)

NSS(Baseline)

Auditory surprise

Auditory event

Feature Intensity Color Orientation Flicker Motion Total

Baseline 30 60 120 30 120 360

Proposed 8 17 46 0 0 71

Selected visual features

The proposed model outperformed the baseline in many frames

Page 25: Computational models of human visual attention driven by auditory cues

24Copyright©2014 NTT corp. All Rights Reserved.

Some extensions

Drawbacks of the proposed method

• 2-pass algorithm:Whole the video should be scanned first to detect synchronization.

Recent updates

• Sequential estimation of visual & auditory surprise via exponential smoothing

Video 1 Video 2 Video 3 Video 4 Video 5 Video 6

Itti2009 2.896 1.816 0.790 1.209 0.318 0.513

Nakajima2013 1.857 0.992 0.540 1.073 0.368 0.216

Proposed (new) 3.077 1.820 0.791 1.273 0.318 0.513

Page 26: Computational models of human visual attention driven by auditory cues

25Copyright©2014 NTT corp. All Rights Reserved.

Conclusion

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive role

• Our model is built on recent psychophysical findings

Human visual attention models with the help of auditory information is underway.

• Auditory attention models

• Auditory cues other than synchronization

Page 27: Computational models of human visual attention driven by auditory cues

26Copyright©2014 NTT corp. All Rights Reserved.

Reference

• Kimura, Yonetani, Hirayama “Computational models of human visual attention and their implementations: A survey,” IEICE Transactions on Information and Systems, Vol.E96-D, No.3, 2013.

• Nakajima, Sugimoto, Kawamoto “Incorporating audio signals into constructing a visual saliency map,” Proc. Pacific-Rim Symposium on Image and Video Technology (PSIVT2013).

• Nakajima, Kimura, Sugimoto, Kashino “Visual attention driven by auditory cues: Selecting visual features in synchronization with attracting auditory events,” Proc. International Conference on Multimedia Modeling (MMM2015).

• Nakajima, Kimura, Sugimoto, Kashino “An online computational model of human visual attention considering spatio-temporal synchronization with auditory events,” IPSJ Technical Report, CVIM195-57, 2015 (in Japanese).