part 2 visual processing and saliency -...
TRANSCRIPT
Computer Vision, Speech Communication & Signal Processing Group, National Technical University of Athens, Greece (NTUA)
Robotic Perception and Interaction Unit,
Athena Research and Innovation Center (Athena RIC)
Part 2Visual Processing and Saliency
Petros Koutras
1
Tutorial at IEEE International Conference on Acoustics, Speech and Signal Processing 2017, New Orleans, USA, March 5, 2017
2Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Processing and SaliencySpatio-Temporal
Processing
Eyes Fixation Prediction Framewise Saliency
Visual Saliency Models
3Tutorial: Multimodal Signal Processing, Saliency and Summarization
Part 2: Outline
Visual Saliency and Attention
State-of-the-Art in Visual Saliency
Spatio-Temporal Framework for Visual Saliency
Applications: Eyes Fixation Prediction, FramewiseSaliency
4Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Saliency and Attention
Visual Attention Top-down, Task-driven High level topics
Visual Saliency Bottom-up, Data-Driven Low level sensory cues
Applications Systems for selecting the most important regions of a large
amount of visual data Movie Summarization Visual Frontend for other applications.
5Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Saliency: Approaches, Measurements
Predict Viewers Fixations both in space and time Eye-tracking data from different users (CRCNS, Eye-Tracking Movie Database ETMD)
Detect Salient Objects Hand annotated databases
Framewise saliency: find the frames that are more salient than the others COGNIMUSE annotated database
6
State-of-the-Art in Visual Saliency
7Tutorial: Multimodal Signal Processing, Saliency and Summarization
Feature Integration Theory (FIT)
[A. Treisman, G. Gelade, “A feature integration theory of attention”, Cognit. Psychol, 1980.][A. Treisman and S. Sato, “Conjunction search revisited”, J. of experimental psychology: human perception and performance, 1990.]
We can detect and identify separable features in parallel across a display this early, parallel, process of
feature registration mediates texture segregation and figure ground grouping
locating any individual feature requires an additional operation
Conjunctions, require focal attention to be directed serially to each relevant location they do not mediate texture
segregation, and they cannot be identified without also being spatially localized
8Tutorial: Multimodal Signal Processing, Saliency and Summarization
Saliency Map Concept
[C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry”, Human Neurobiol., 1985.]
9Tutorial: Multimodal Signal Processing, Saliency and Summarization
Saliency Map
Estimated Saliency MapOriginal Image
Spatial Saliency Benchmarks: http://saliency.mit.edu/index.html
10Tutorial: Multimodal Signal Processing, Saliency and Summarization
First Computational Model (Itti et al. 1998)
[L. Itti, C. Koch and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis”, IEEE Trans. PAMI, 1998.]
11Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatio-temporal Extension (Itti et al.)
[L. Itti, N. Dhavale and F. Pighin, “Realistic avatar eye and head animation using a neurobiological model of visual attention”, SPIE 48th Annual International Symposium on Optical Science and Technology, 2003.]
5 Feature maps 3 static features Intensity Color Orientation
2 spatiotemporal features Flicker Motion
12Tutorial: Multimodal Signal Processing, Saliency and Summarization
Graph-Based Visual Saliency (GBVS)
Markovian Approach Dissimilarity between the pixels (i,j)
& (p,q) of the feature Map M(i,j):
Weight for the edge from node (i,j) to node (p,q)
Define a Markov Chain on the Graph Normalize the weights of the outbound
edges Nodes States Weights transition probabilities Find the Equilibrium Distribution of this
Chain
Stages for Visual Saliency Extraction: extract feature vectors (intensity,
color, orientation) Activation: form the activation maps from the
feature vectors Normalization/Combination: Normalize the
activation maps and combine the maps into a single map
J. Harel, C. Koch and P. Perona, “Graph-based visual saliency”, NIPS 2006.
13Tutorial: Multimodal Signal Processing, Saliency and Summarization
Adaptive Whitening Saliency (AWS)
Chromatic Decomposition Log-Gabor Filters Oriented Multiscale Decomposition and Whitening[A. Garcia-Diaz, X.R. Fernandez-Vidal, X.M. Pardo and R. Dosil, “Saliency from hierarchical adaptation through decorrelation and variance normalization”, Image Vis. Comput., 2012.]
14Tutorial: Multimodal Signal Processing, Saliency and Summarization
Scene Context (GIST)
From very brief exposure to a scene, we can already extract a lot of information about its global structure, its category and some of its components.
[A. Torralba, A. Oliva, M. Castelhano and J. M. Henderson, “Contextual Guidance of Attention in Natural scenes: The role of Global features on object search”, Psychological Review, 2006.]
15Tutorial: Multimodal Signal Processing, Saliency and Summarization
Saliency Using Natural Scene Statistics (SUN)
Static model of natural image statistics, modeled as lends itself to a very fast computational framework
Spatio-temporal extension: SUNDAy, Dynamic analysis of scenes
[L. Zhang, M.H. Tong, T.K. Marks, H. Shan and G.W. Cottrell, “Sun: a Bayesian framework for saliency using natural statistics”, J. Vis., 2008.][L. Zhang, M.H. Tong and G.W. Cottrell, “SUNDAy: Saliency using natural statistics for dynamic analysis of scenes”, 31st annual cognitive science conference, 2009.]
16Tutorial: Multimodal Signal Processing, Saliency and Summarization
Bayesian and Surprise Models
[L. Itti and P. Baldi, “Bayesian surprise attracts human attention”, NIPS 2005.]
17Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatial Bayesian Surprise Spatial surprise from an
image region to account for saliency due to contrast with context: prior: feature distribution in
spatial context (surroundings) posterior: distribution after
observing the region of interest
Extension of visual attention model through the use of surprise values instead of raw feature maps
[I.Gkioulekas,G.Evangelopoulos andP.Maragos,“SpatialBayesianSurpriseforImageSaliencyandQualityAssessment”,ICIP2010]
18Tutorial: Multimodal Signal Processing, Saliency and Summarization
AIM (Attention by Information Maximization)
Independent (sparse) coding
Want to quantify likelihood of observing local patch/region of image
Likelihood related to self-information via –log(p(x))
[N. Bruce and J. Tsotsos, “Saliency based on information maximization”, NIPS 2005.]
19Tutorial: Multimodal Signal Processing, Saliency and Summarization
Incremental Coding Length
Measure entropy gain of each feature
Maximize entropy across sample features
Select features with large coding length increment
[X. Hou and L. Zhang, “Dynamic visual attention: searching for coding length increments”, NIPS 2009.]
20Tutorial: Multimodal Signal Processing, Saliency and Summarization
Discriminant / Decision Theoretic Saliency
Derived explicitly from a minimum Bayes error definition “c” applicable to centre/surround, but also other classes
(e.g. face vs. null hypothesis)
[D. Gao and N. Vasconcelos, “Discriminant saliency for visual recognition from cluttered scenes”, NIPS 2004.][D. Gao, S. Han and N. Vasconcelos, “Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition”, IEEE Trans. PAMI, 2009.]
21Tutorial: Multimodal Signal Processing, Saliency and Summarization
Rarity Based Saliency
Considers rarity of features (both local and global, including self-information)
Multi-scale approach reminiscent of Itti et al.
Normalization/Whitening across color inputs and across scale, weighted combination/fusion
[N. Riche, M. Mancas, M. Duvinage, M. Mibulumukini, B. Gosselin and T. Dutoit, “Rare2012: a multi-scale rarity-based saliency detection with its comparative statistical analysis”, Sig. Proc.: Im. Com., 2013.]
22Tutorial: Multimodal Signal Processing, Saliency and Summarization
Saliency by Self-Resemblance
Local structure represented by matrix of local descriptors (steering kernels robust to noise/image distortions)
Matrix cosine similarity forms a metric for resemblance at pixel to surround
Amounts to an estimate of likelihood of local feature matrix given feature matrix of pixels in surround
[H.J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance”, J. Vis., 2009.]
23Tutorial: Multimodal Signal Processing, Saliency and Summarization
Boolean Map Based Saliency
Generate a set of Boolean maps by randomly thresholding the input image’s feature maps CIE Lab color space (perceptually uniform)
Given a Boolean map B , BMS computes an attention map A(B)based on a Gestalt principle for figure-ground segregation: surrounded regions are more likely to be perceived as figures
All attention maps are linearly combined into a full resolution mean attention map
[J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach”, CVPR 2013.]
24Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spectral Saliency Estimation
Phase-only Fourier Transform (PFT): All you need is the phase! Quaternion Fourier Transform (PQFT): Computing grayscale image,
color-opponent images, and frame difference image in one Quaternion transform.
[X. Hou and L. Zhang, “Saliency detection: a spectral residual approach”, CVPR 2007.][C. Guo, Q. Ma and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform”, CVPR 2008.] [C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Trans. Image Process., 2010.]
25Tutorial: Multimodal Signal Processing, Saliency and Summarization
More on Spectral Saliency No scale parameter in spectral saliency?
Scale is the size! [32x24], [64x48], [128x96] are reasonable
choices.
PQFT [Guo et. al., CVPR 2008]: Compute frame difference as the “motion
channel”. Apply spectral saliency (separately or using
quaternion).
Spectral saliency in real domain Image Signature (SIG): [Hou et. al., PAMI 12]ImageSignature = sign(dct2(img));
QDCT: [Schauerte et. al., ECCV 12]Extending Image Signature to Quaternion DCT.
64x48 681x511
[C. Guo, Q. Ma and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform”, CVPR 2008.][X. Hou, J. Harel and C. Koch, “Image signature: highlighting sparse salient regions”, IEEE Trans. PAMI 2012.][B. Schauerte and R. Stiefelhagen, “Quaternion-based spectral saliency detection for eye fixation prediction”, ECCV 2012.]
26Tutorial: Multimodal Signal Processing, Saliency and Summarization
Machine Learning Techniques Still Images [Judd et al. CVPR 2009]
Features: Low level: luminance, orientation, color Mid level: vanishing point, horizon line High level: face detection, object detection
Linear Support Vector Machine Test on single features and all features
Ensemble of Deep Networks (eDN): Features from 1-3 layer networks SVM based training fixated and non-fixated regions
Video Saliency [Rudoy et al. CVPR 2013] Candidate extraction:
Static (GBVS) Motion (Optical flow, DoG) Semantic (Face and body estimation)
Modeling gaze dynamics: Gaze transitions for training Learning transition probability
[T. Judd, K. Ehinger, F. Durand and A. Torralba, “Learning to predict where humans look”, CVPR 2009.][E. Vig, M. Dorr and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images”, CVPR 2014.][D. Rudoy, D.B. Goldman, E. Shechtman and L. Zelnik-Manor, “Learning video saliency from human gaze using candidate selection”, CVPR 2013.]
27Tutorial: Multimodal Signal Processing, Saliency and Summarization
Task-specific Learning Techniques Based on bottom-up
saliency and gist descriptors
Employed for task-specific or multi-task eye-tracking prediction in spatio-temporal stimuli
[A. Borji, D.N. Sihite and L. Itti, “Probabilistic learning of task-specific visual attention”, CVPR 2012.][J. Li, Y. Tian, T. Huang and W. Gao, “Probabilistic multi-task learning for visual saliency estimation in video”, Int. J. Comp. Vis., 2010.]
28Tutorial: Multimodal Signal Processing, Saliency and Summarization
CNN-based Saliency Models Adaptation of CNN models for visual
recognition task Linear combination of different layers
and Gaussian blurring Multiscale Information Objective functions to optimize
common saliency evaluation metrics
[M. Kümmerer, L. Theis, and M. Bethge, “Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet”, ICLR Workshop 2015]
[X. Huang, C. Shen, X. Boix and Q. Zhao, “SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks”, ICCV 2015.
29Tutorial: Multimodal Signal Processing, Saliency and Summarization
Patch-based CNN Saliency Model Extract fixation and non-fixation image regions to train end-to-end
binary multiresolution CNN At testing composite maps from small image regions to construct
the final saliency map
[N.Liu,J.Han,D.Zhang,S.WenandT.Liu,“Predictingeyefixationsusingconvolutionalneuralnetworks”.CVPR2015.]
30Tutorial: Multimodal Signal Processing, Saliency and Summarization
Loss Functions for End-to-End Saliency Mapping
Saliency is a dense prediction problem: Standard loss
functions for regression
Losses based on probability distance measures
[S.Jetley,N.MurrayandE.Vig,“End‐to‐endsaliencymappingviaprobabilitydistributionprediction”,CVPR2016.]
31Tutorial: Multimodal Signal Processing, Saliency and Summarization
Fixation Prediction Evaluation Datasets
Spatial (Still images) MIT Bruce and Tsotsos (Torondo) Kootstra CAT2000 SALICON …
Spatio-Temporal (Videos) CRCNS DIEM Action in the Eye Eye-Tracking Movie Database
(ETMD) …
[A.Borji andL.Itti,“State‐of‐the‐artinvisualattentionmodeling”,IEEETrans.PAMI,2013.]
32Tutorial: Multimodal Signal Processing, Saliency and Summarization
Evaluation Results (Static Databases)
[A. Borji, D.N. Sihite and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study”, IEEE Trans. Image Process., 2013.]
33Tutorial: Multimodal Signal Processing, Saliency and Summarization
Evaluation Results (Video Databases)
[A. Borji, D.N. Sihite and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study”, IEEE Trans. Image Process., 2013.]
34Tutorial: Multimodal Signal Processing, Saliency and Summarization
Salient Object Detection
Labeled Regions rather than fixation points: Salient Objects Dataset
(SOD) Extended Complex
Scene Saliency Dataset (ECSSD)
These two kinds of evaluation can disagree with each other.
[A. Borji, M.M. Cheng, H. Jiang and J. Li, “Salient object detection: A benchmark”, IEEE Trans. Image Process., 2015.]
35
Spatio-Temporal Framework for
Visual Saliency
[P. Koutras and P. Maragos, “A Perceptually-based Spatio-Temporal Computational Framework for Visual Saliency Estimation”, Sig. Proc.: Im. Com., 2015.]
36Tutorial: Multimodal Signal Processing, Saliency and Summarization
Why Spatio-Temporal Saliency?AWS
Spatio-Temporal Energy
37Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Saliency RepresentationsSpatio-Temporal Processing
for Visual Saliency
EnergySTIP
SaliencyMaps
VisualCurves
Movie Summarization Action Recognition
Eyes Fixation Prediction
38Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatio-Temporal Frontend for Visual Saliency
Relevant to the cognition-inspired saliency methods, based on Koch & Ullman theory.
Uses biologically plausible spatio-temporal filters, like oriented 3D Gabor filters, in order to extract visual features.
Detects both the fastest changes in the video stimuli (e.g. flicker) and the slowest motion changes related to action events.
39Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatio-Temporal Frontend for Visual SaliencyOverview
Color Modeling CIE-Lab or PCA projected color space Luminance stream: Color steam:
[P. Koutras and P. Maragos, “A Perceptually-based Spatio-Temporal Computational Framework for Visual Saliency Estimation”, Sig. Proc.: Im. Com., 2015.]
40Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatio-Temporal Dominant Analysis (STDA)
Extract 3 dominant energy volumes for each stream (expressing basic perceptual concepts in visual saliency) Spatio-Temporal related with motion Static (or Spatial) related with frames texture or edges LowPass related with that other model call “intensity” (which can be either in
luminance or color stream)
41Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatio-Temporal Gabor Filterbank
42Tutorial: Multimodal Signal Processing, Saliency and Summarization
Spatial Gabor Filterbank
5 Positive & 5 Negative Temporal Frequencies
5 scales8 orientations
Full Spatial Filterbank(40 Filters)
3 scales
8 orientations
Reduced Spatial Filterbank(12 Filters)
43Tutorial: Multimodal Signal Processing, Saliency and Summarization
Separable 3D Gabor Filters
Quadrature Pairs of Separable 3D Gabor Filters
[K. Maninis, P. Koutras and P. Maragos, “Advances on Action Recognition in Videos Using and Interest Point Detector Based on Multiband Spatio-Temporal Energies”, ICIP 2014.]
44Tutorial: Multimodal Signal Processing, Saliency and Summarization
Postprocessing
Quadrature Pair Square Energy for each Gabor filter:
Dominant Energy Selection:and
Center-Surround Difference for Low-pass Energy
Temporal Moving Average (TMA) for all energy types.
45Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Saliency in Movie Videos - Demo
Original RGB Frames Luminance STDE Color Contrast Low-pass Energy
COGNIMUSE Database:Lord of the Rings: The Return of the King
46
Eyes Fixation Prediction
[P. Koutras and P. Maragos, “A Perceptually-based Spatio-Temporal Computational Framework for Visual Saliency Estimation”, Sig. Proc.: Im. Com., 2015.]
47Tutorial: Multimodal Signal Processing, Saliency and Summarization
Visual Saliency Demo
Original RGB Frames with Eye Tracking
Luminance Spatio-Temporal Dominant Energy
48Tutorial: Multimodal Signal Processing, Saliency and Summarization
Eye-Tracking Movie Database - Examples
[P. Koutras, A. Katsamanis and P. Maragos, “Predicting Eyes' Fixations in Movie Videos: Visual Saliency Experiments on a New Eye-Tracking Database”, HCI 2014.]
49Tutorial: Multimodal Signal Processing, Saliency and Summarization
Center Bias – ETMD Fixations
50Tutorial: Multimodal Signal Processing, Saliency and Summarization
Evaluation Measures for Visual Saliency
Correlation Coefficient (CC)Centering a 2D Gaussian at each viewer’ eye fixation.
Normalized Scanpath Saliency (NSS)Standardization (zero-mean and unit-variance)Values at each viewer fixation positionTake the mean over all viewers fixations
Area Under Curve (AUC)Area under the Receiver Operating Characteristic
(ROC) curve (False Positive Rate – Recall)Binary classification problem: (salient / non salient
regions)
[A. Borji, D.N. Sihite and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study”, IEEE Trans. Image Process., 2013.]
51Tutorial: Multimodal Signal Processing, Saliency and Summarization
Fixation Prediction Results – CRCNS ORIG Compare with 15 state-of-art
model 3 spatio-temporal models
related with 3 basic approaches: Cognitive inspired Statistical framework Frequency domain analysis
52Tutorial: Multimodal Signal Processing, Saliency and Summarization
6 Oscar-winning Hollywood Movies Chicago, Crash, Departed, Finding Nemo,
Gladiator, Lord of the Rings 2 short video clips (3-3.5 min) from each
movies Scenes with both high action and dialogues
Eye-tracking Human Annotation Eye-tracking data by 10 different people Both grayscale and color versions of each
video One fixation point per frame
Fixation Prediction Results – ETMD
53
Framewise Saliency
[P. Koutras, A. Zlatintsi, E.Iosif, A. Katsamanis, P. Maragos and A. Potamianos, “Predicting Audio-Visual Salient Events Based on Visual, Audio and Text Modalities for Movie Summarization”, ICIP 2015.]
54Tutorial: Multimodal Signal Processing, Saliency and Summarization
Framewise Visual Saliency - Features
Visual Features
3D Gabor Energy model
Both luminance and color streams: Spatio-Temporal Dominant
Energies (Filterbank of 400 3D Gabor filters)
Spatial Dominant Energies(Filterbank of 40 Spatial Gabor filters)
55Tutorial: Multimodal Signal Processing, Saliency and Summarization
Framewise Visual Saliency – Energy Curves3D Gabor Energy model
Energy Curves Simple 3D to 1D Mapping Mean value for each 2D frame slice of each 3D energy volume (STDE, SDE) 4 temporal sequences of visual feature vectors
56Tutorial: Multimodal Signal Processing, Saliency and Summarization
Framewise Visual Saliency - Summarization
Features Postprocessing: Standardize Features (zero mean, unit covariance) Compute 1st and 2nd order derivatives (deltas)
Classification based on KNN or GMM: Binary classification problem (salient / non salient video segments) Confidence Scores Median Filtering of Saliency Measurement Sorting the Frames based on Saliency Measurement
57Tutorial: Multimodal Signal Processing, Saliency and Summarization
Results on Hollywood Movies - ROC Curves
AUCTMM’13 (KNN) 0,603
ICIP’15 (KNN) 0,699ICIP’15
GMM(M=10) 0,660ICIP’15
GMM(M=10, Viterbi) 0,668
[G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, Y. Avrithis, “Multimodal Saliency and Fusionfor Movie Summarization based on Aural, Visual, and Textual Attention” IEEE Trans.-MM, 2013.][P. Koutras, A. Zlatintsi, E.Iosif, A. Katsamanis, P. Maragos and A. Potamianos, “Predicting Audio-Visual Salient Events Based on Visual,Audio and Text Modalities for Movie Summarization”, ICIP 2015.]
58Tutorial: Multimodal Signal Processing, Saliency and Summarization
Part 2: Conclusions
Importance of spatio-temporal video processing: Saliency estimation Event detection Summarization
Visual Saliency Deep networks achieve state-of-the-art performance on standard
benchmarks Datasets: small-scale, center-biased, biased towards semantic
objects Spatio-temporal saliency networks Eye fixation prediction Framewise saliency
Tutorial slides: http://cognimuse.cs.ntua.gr/icassp17