part 2 visual processing and saliency -...

Computer Vision, Speech Communication & Signal Processing Group, National Technical University of Athens, Greece (NTUA)

Robotic Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Part 2Visual Processing and Saliency

Petros Koutras

1

Tutorial at IEEE International Conference on Acoustics, Speech and Signal Processing 2017, New Orleans, USA, March 5, 2017

2Tutorial: Multimodal Signal Processing, Saliency and Summarization

Visual Processing and SaliencySpatio-Temporal

Processing

Eyes Fixation Prediction Framewise Saliency

Visual Saliency Models


Part 2: Outline

Visual Saliency and Attention

State-of-the-Art in Visual Saliency

Spatio-Temporal Framework for Visual Saliency

Applications: Eyes Fixation Prediction, FramewiseSaliency


Visual Saliency and Attention

Visual Attention Top-down, Task-driven High level topics

Visual Saliency Bottom-up, Data-Driven Low level sensory cues

Applications Systems for selecting the most important regions of a large

amount of visual data Movie Summarization Visual Frontend for other applications.


Visual Saliency: Approaches, Measurements

Predict Viewers Fixations both in space and time Eye-tracking data from different users (CRCNS, Eye-Tracking Movie Database ETMD)

Detect Salient Objects Hand annotated databases

Framewise saliency: find the frames that are more salient than the others COGNIMUSE annotated database

6

State-of-the-Art in Visual Saliency


Feature Integration Theory (FIT)

[A. Treisman, G. Gelade, “A feature integration theory of attention”, Cognit. Psychol, 1980.][A. Treisman and S. Sato, “Conjunction search revisited”, J. of experimental psychology: human perception and performance, 1990.]

We can detect and identify separable features in parallel across a display this early, parallel, process of

feature registration mediates texture segregation and figure ground grouping

locating any individual feature requires an additional operation

Conjunctions, require focal attention to be directed serially to each relevant location they do not mediate texture

segregation, and they cannot be identified without also being spatially localized


Saliency Map Concept

[C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry”, Human Neurobiol., 1985.]


Saliency Map

Estimated Saliency MapOriginal Image

Spatial Saliency Benchmarks: http://saliency.mit.edu/index.html


First Computational Model (Itti et al. 1998)

[L. Itti, C. Koch and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis”, IEEE Trans. PAMI, 1998.]


Spatio-temporal Extension (Itti et al.)

[L. Itti, N. Dhavale and F. Pighin, “Realistic avatar eye and head animation using a neurobiological model of visual attention”, SPIE 48th Annual International Symposium on Optical Science and Technology, 2003.]

5 Feature maps 3 static features Intensity Color Orientation

2 spatiotemporal features Flicker Motion


Graph-Based Visual Saliency (GBVS)

Markovian Approach Dissimilarity between the pixels (i,j)

& (p,q) of the feature Map M(i,j):

Weight for the edge from node (i,j) to node (p,q)

Define a Markov Chain on the Graph Normalize the weights of the outbound

edges Nodes States Weights transition probabilities Find the Equilibrium Distribution of this

Chain

Stages for Visual Saliency Extraction: extract feature vectors (intensity,

color, orientation) Activation: form the activation maps from the

feature vectors Normalization/Combination: Normalize the

activation maps and combine the maps into a single map

J. Harel, C. Koch and P. Perona, “Graph-based visual saliency”, NIPS 2006.


Adaptive Whitening Saliency (AWS)

Chromatic Decomposition Log-Gabor Filters Oriented Multiscale Decomposition and Whitening[A. Garcia-Diaz, X.R. Fernandez-Vidal, X.M. Pardo and R. Dosil, “Saliency from hierarchical adaptation through decorrelation and variance normalization”, Image Vis. Comput., 2012.]


Scene Context (GIST)

From very brief exposure to a scene, we can already extract a lot of information about its global structure, its category and some of its components.

[A. Torralba, A. Oliva, M. Castelhano and J. M. Henderson, “Contextual Guidance of Attention in Natural scenes: The role of Global features on object search”, Psychological Review, 2006.]


Saliency Using Natural Scene Statistics (SUN)

Static model of natural image statistics, modeled as lends itself to a very fast computational framework

Spatio-temporal extension: SUNDAy, Dynamic analysis of scenes

[L. Zhang, M.H. Tong, T.K. Marks, H. Shan and G.W. Cottrell, “Sun: a Bayesian framework for saliency using natural statistics”, J. Vis., 2008.][L. Zhang, M.H. Tong and G.W. Cottrell, “SUNDAy: Saliency using natural statistics for dynamic analysis of scenes”, 31st annual cognitive science conference, 2009.]


Bayesian and Surprise Models

[L. Itti and P. Baldi, “Bayesian surprise attracts human attention”, NIPS 2005.]


Spatial Bayesian Surprise Spatial surprise from an

image region to account for saliency due to contrast with context: prior: feature distribution in

spatial context (surroundings) posterior: distribution after

observing the region of interest

Extension of visual attention model through the use of surprise values instead of raw feature maps

[I.Gkioulekas,G.Evangelopoulos andP.Maragos,“SpatialBayesianSurpriseforImageSaliencyandQualityAssessment”,ICIP2010]


AIM (Attention by Information Maximization)

Independent (sparse) coding

Want to quantify likelihood of observing local patch/region of image

Likelihood related to self-information via –log(p(x))

[N. Bruce and J. Tsotsos, “Saliency based on information maximization”, NIPS 2005.]


Incremental Coding Length

Measure entropy gain of each feature

Maximize entropy across sample features

Select features with large coding length increment

[X. Hou and L. Zhang, “Dynamic visual attention: searching for coding length increments”, NIPS 2009.]


Discriminant / Decision Theoretic Saliency

Derived explicitly from a minimum Bayes error definition “c” applicable to centre/surround, but also other classes

(e.g. face vs. null hypothesis)

[D. Gao and N. Vasconcelos, “Discriminant saliency for visual recognition from cluttered scenes”, NIPS 2004.][D. Gao, S. Han and N. Vasconcelos, “Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition”, IEEE Trans. PAMI, 2009.]


Rarity Based Saliency

Considers rarity of features (both local and global, including self-information)

Multi-scale approach reminiscent of Itti et al.

Normalization/Whitening across color inputs and across scale, weighted combination/fusion

[N. Riche, M. Mancas, M. Duvinage, M. Mibulumukini, B. Gosselin and T. Dutoit, “Rare2012: a multi-scale rarity-based saliency detection with its comparative statistical analysis”, Sig. Proc.: Im. Com., 2013.]


Saliency by Self-Resemblance

Local structure represented by matrix of local descriptors (steering kernels robust to noise/image distortions)

Matrix cosine similarity forms a metric for resemblance at pixel to surround

Amounts to an estimate of likelihood of local feature matrix given feature matrix of pixels in surround

[H.J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance”, J. Vis., 2009.]


Boolean Map Based Saliency

Generate a set of Boolean maps by randomly thresholding the input image’s feature maps CIE Lab color space (perceptually uniform)

Given a Boolean map B , BMS computes an attention map A(B)based on a Gestalt principle for figure-ground segregation: surrounded regions are more likely to be perceived as figures

All attention maps are linearly combined into a full resolution mean attention map

[J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach”, CVPR 2013.]


Spectral Saliency Estimation

Phase-only Fourier Transform (PFT): All you need is the phase! Quaternion Fourier Transform (PQFT): Computing grayscale image,

color-opponent images, and frame difference image in one Quaternion transform.

[X. Hou and L. Zhang, “Saliency detection: a spectral residual approach”, CVPR 2007.][C. Guo, Q. Ma and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform”, CVPR 2008.] [C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Trans. Image Process., 2010.]


More on Spectral Saliency No scale parameter in spectral saliency?

Scale is the size! [32x24], [64x48], [128x96] are reasonable

choices.

PQFT [Guo et. al., CVPR 2008]: Compute frame difference as the “motion

channel”. Apply spectral saliency (separately or using

quaternion).

Spectral saliency in real domain Image Signature (SIG): [Hou et. al., PAMI 12]ImageSignature = sign(dct2(img));

QDCT: [Schauerte et. al., ECCV 12]Extending Image Signature to Quaternion DCT.

64x48 681x511

[C. Guo, Q. Ma and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform”, CVPR 2008.][X. Hou, J. Harel and C. Koch, “Image signature: highlighting sparse salient regions”, IEEE Trans. PAMI 2012.][B. Schauerte and R. Stiefelhagen, “Quaternion-based spectral saliency detection for eye fixation prediction”, ECCV 2012.]


Machine Learning Techniques Still Images [Judd et al. CVPR 2009]

Features: Low level: luminance, orientation, color Mid level: vanishing point, horizon line High level: face detection, object detection

Linear Support Vector Machine Test on single features and all features

Ensemble of Deep Networks (eDN): Features from 1-3 layer networks SVM based training fixated and non-fixated regions

Video Saliency [Rudoy et al. CVPR 2013] Candidate extraction:

Static (GBVS) Motion (Optical flow, DoG) Semantic (Face and body estimation)

Modeling gaze dynamics: Gaze transitions for training Learning transition probability

[T. Judd, K. Ehinger, F. Durand and A. Torralba, “Learning to predict where humans look”, CVPR 2009.][E. Vig, M. Dorr and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images”, CVPR 2014.][D. Rudoy, D.B. Goldman, E. Shechtman and L. Zelnik-Manor, “Learning video saliency from human gaze using candidate selection”, CVPR 2013.]


Task-specific Learning Techniques Based on bottom-up

saliency and gist descriptors

Employed for task-specific or multi-task eye-tracking prediction in spatio-temporal stimuli

[A. Borji, D.N. Sihite and L. Itti, “Probabilistic learning of task-specific visual attention”, CVPR 2012.][J. Li, Y. Tian, T. Huang and W. Gao, “Probabilistic multi-task learning for visual saliency estimation in video”, Int. J. Comp. Vis., 2010.]


CNN-based Saliency Models Adaptation of CNN models for visual

recognition task Linear combination of different layers

and Gaussian blurring Multiscale Information Objective functions to optimize

common saliency evaluation metrics

[M. Kümmerer, L. Theis, and M. Bethge, “Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet”, ICLR Workshop 2015]

[X. Huang, C. Shen, X. Boix and Q. Zhao, “SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks”, ICCV 2015.


Patch-based CNN Saliency Model Extract fixation and non-fixation image regions to train end-to-end

binary multiresolution CNN At testing composite maps from small image regions to construct

the final saliency map

[N.Liu,J.Han,D.Zhang,S.WenandT.Liu,“Predictingeyefixationsusingconvolutionalneuralnetworks”.CVPR2015.]


Loss Functions for End-to-End Saliency Mapping

Saliency is a dense prediction problem: Standard loss

functions for regression

Losses based on probability distance measures

[S.Jetley,N.MurrayandE.Vig,“End‐to‐endsaliencymappingviaprobabilitydistributionprediction”,CVPR2016.]


Fixation Prediction Evaluation Datasets

Spatial (Still images) MIT Bruce and Tsotsos (Torondo) Kootstra CAT2000 SALICON …

Spatio-Temporal (Videos) CRCNS DIEM Action in the Eye Eye-Tracking Movie Database

(ETMD) …

[A.Borji andL.Itti,“State‐of‐the‐artinvisualattentionmodeling”,IEEETrans.PAMI,2013.]


Evaluation Results (Static Databases)

[A. Borji, D.N. Sihite and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study”, IEEE Trans. Image Process., 2013.]


Evaluation Results (Video Databases)



Salient Object Detection

Labeled Regions rather than fixation points: Salient Objects Dataset

(SOD) Extended Complex

Scene Saliency Dataset (ECSSD)

These two kinds of evaluation can disagree with each other.

[A. Borji, M.M. Cheng, H. Jiang and J. Li, “Salient object detection: A benchmark”, IEEE Trans. Image Process., 2015.]

35

Spatio-Temporal Framework for

Visual Saliency

[P. Koutras and P. Maragos, “A Perceptually-based Spatio-Temporal Computational Framework for Visual Saliency Estimation”, Sig. Proc.: Im. Com., 2015.]


Why Spatio-Temporal Saliency?AWS

Spatio-Temporal Energy


Visual Saliency RepresentationsSpatio-Temporal Processing

for Visual Saliency

EnergySTIP

SaliencyMaps

VisualCurves

Movie Summarization Action Recognition

Eyes Fixation Prediction


Spatio-Temporal Frontend for Visual Saliency

Relevant to the cognition-inspired saliency methods, based on Koch & Ullman theory.

Uses biologically plausible spatio-temporal filters, like oriented 3D Gabor filters, in order to extract visual features.

Detects both the fastest changes in the video stimuli (e.g. flicker) and the slowest motion changes related to action events.


Spatio-Temporal Frontend for Visual SaliencyOverview

Color Modeling CIE-Lab or PCA projected color space Luminance stream: Color steam:



Spatio-Temporal Dominant Analysis (STDA)

Extract 3 dominant energy volumes for each stream (expressing basic perceptual concepts in visual saliency) Spatio-Temporal related with motion Static (or Spatial) related with frames texture or edges LowPass related with that other model call “intensity” (which can be either in

luminance or color stream)


Spatio-Temporal Gabor Filterbank


Spatial Gabor Filterbank

5 Positive & 5 Negative Temporal Frequencies

5 scales8 orientations

Full Spatial Filterbank(40 Filters)

3 scales

8 orientations

Reduced Spatial Filterbank(12 Filters)


Separable 3D Gabor Filters

Quadrature Pairs of Separable 3D Gabor Filters

[K. Maninis, P. Koutras and P. Maragos, “Advances on Action Recognition in Videos Using and Interest Point Detector Based on Multiband Spatio-Temporal Energies”, ICIP 2014.]


Postprocessing

Quadrature Pair Square Energy for each Gabor filter:

Dominant Energy Selection:and

Center-Surround Difference for Low-pass Energy

Temporal Moving Average (TMA) for all energy types.


Visual Saliency in Movie Videos - Demo

Original RGB Frames Luminance STDE Color Contrast Low-pass Energy

COGNIMUSE Database:Lord of the Rings: The Return of the King

46

Eyes Fixation Prediction



Visual Saliency Demo

Original RGB Frames with Eye Tracking

Luminance Spatio-Temporal Dominant Energy


Eye-Tracking Movie Database - Examples

[P. Koutras, A. Katsamanis and P. Maragos, “Predicting Eyes' Fixations in Movie Videos: Visual Saliency Experiments on a New Eye-Tracking Database”, HCI 2014.]


Center Bias – ETMD Fixations


Evaluation Measures for Visual Saliency

Correlation Coefficient (CC)Centering a 2D Gaussian at each viewer’ eye fixation.

Normalized Scanpath Saliency (NSS)Standardization (zero-mean and unit-variance)Values at each viewer fixation positionTake the mean over all viewers fixations

Area Under Curve (AUC)Area under the Receiver Operating Characteristic

(ROC) curve (False Positive Rate – Recall)Binary classification problem: (salient / non salient

regions)



Fixation Prediction Results – CRCNS ORIG Compare with 15 state-of-art

model 3 spatio-temporal models

related with 3 basic approaches: Cognitive inspired Statistical framework Frequency domain analysis


6 Oscar-winning Hollywood Movies Chicago, Crash, Departed, Finding Nemo,

Gladiator, Lord of the Rings 2 short video clips (3-3.5 min) from each

movies Scenes with both high action and dialogues

Eye-tracking Human Annotation Eye-tracking data by 10 different people Both grayscale and color versions of each

video One fixation point per frame

Fixation Prediction Results – ETMD

53

Framewise Saliency

[P. Koutras, A. Zlatintsi, E.Iosif, A. Katsamanis, P. Maragos and A. Potamianos, “Predicting Audio-Visual Salient Events Based on Visual, Audio and Text Modalities for Movie Summarization”, ICIP 2015.]


Framewise Visual Saliency - Features

Visual Features

3D Gabor Energy model

Both luminance and color streams: Spatio-Temporal Dominant

Energies (Filterbank of 400 3D Gabor filters)

Spatial Dominant Energies(Filterbank of 40 Spatial Gabor filters)


Framewise Visual Saliency – Energy Curves3D Gabor Energy model

Energy Curves Simple 3D to 1D Mapping Mean value for each 2D frame slice of each 3D energy volume (STDE, SDE) 4 temporal sequences of visual feature vectors


Framewise Visual Saliency - Summarization

Features Postprocessing: Standardize Features (zero mean, unit covariance) Compute 1st and 2nd order derivatives (deltas)

Classification based on KNN or GMM: Binary classification problem (salient / non salient video segments) Confidence Scores Median Filtering of Saliency Measurement Sorting the Frames based on Saliency Measurement


Results on Hollywood Movies - ROC Curves

AUCTMM’13 (KNN) 0,603

ICIP’15 (KNN) 0,699ICIP’15

GMM(M=10) 0,660ICIP’15

GMM(M=10, Viterbi) 0,668

[G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, Y. Avrithis, “Multimodal Saliency and Fusionfor Movie Summarization based on Aural, Visual, and Textual Attention” IEEE Trans.-MM, 2013.][P. Koutras, A. Zlatintsi, E.Iosif, A. Katsamanis, P. Maragos and A. Potamianos, “Predicting Audio-Visual Salient Events Based on Visual,Audio and Text Modalities for Movie Summarization”, ICIP 2015.]


Part 2: Conclusions

Importance of spatio-temporal video processing: Saliency estimation Event detection Summarization

Visual Saliency Deep networks achieve state-of-the-art performance on standard

benchmarks Datasets: small-scale, center-biased, biased towards semantic

objects Spatio-temporal saliency networks Eye fixation prediction Framewise saliency

Tutorial slides: http://cognimuse.cs.ntua.gr/icassp17

part 2 visual processing and saliency -...

Documents