iccs-ntua contributions to e-teams of muscle wp6 and wp10

ICCS-NTUA Contributions to E-teams of

MUSCLE WP6 and WP10

Prof. Petros MaragosNational Technical University of Athens

School of Electrical and Computer Engineering

URL: http://cvsp.cs.ntua.gr/projects/muscle

WP6 E-teams: 8-12-2005 MUSCLEMUSCLEICCS - NTUA

ICCS-NTUA: E-team Researchers & Directions Researchers:

P. Maragos, S. Kollias (Faculty members)

G. Papandreou, K. Rapantzikos, G. Evangelopoulos, A. Katsamanis,

I. Kokkinos (PhD GRA)

G. Stamou, I. Avrithis (Post-Doc) (WP6) E-team 1: Audio-Visual (AV) Speech Analysis & Recognition

Face Detection, Modeling & Tracking

AV Feature Extraction, Fusion, Dynamic Models for AV-ASR

AV to Articulatory Speech Inversion

(WP6) E-team 2: Audio-Visual Understanding

Audio-Visual Salient Event Detection,

Integrated Multimedia Content Analysis


AV-ASR Front-End

SpeechFeature Transform./Selection

Modulations – Energy• Multiband Filtering• Nonlinear Processing• Demodulation

VAD

Dynamics - Fractals • Embedding• Geometrical Filtering• Fractal Dimensions

Speaker Normalization

( )is t

M-Array

Processing

Visual • Active Appearance Model• Face Detection/Tracking• Mouth R.O.I. Features

Fusion

Feature Stream

MFCC


Audiovisual ASR: Face Modeling

● A well studied problem in Computer Vision:● Active Appearance Models, Morphable Models, Active Blobs

● Both Shape & Appearance can enhance lipreading● The shape and appearance of human faces “live” in low

dimensional manifolds

+p1 +p2=

1 2=


Image Fitting Example

step 2 step 6 step 10

step 14 step 18


Example: Face Interpretation Using AAM

original video

shape track superimposed

on original video

reconstructed faceThis is what the

visual-only speech recognizer “sees”!

Generative models like AAM allow us to evaluate the output of the visual front-end


Joint Image Segmentation and Object Detection via the Expectation Maximization algorithm

•Generative models ‘compete’ for image observations

•Segmentation translates into the assignment of image observations into one of K models (image labelling)

•Segmentation labels are treated like hidden data

•EM algorithm:

•Ε-step: use current parameter estimates to assign micro-segments to objects

•M-step use assignment probabilities to derive optimal model parameters

•Active Appearance Models used as generative

models for the object categories of cars and faces


Top-Down Segmentation Results Thresholding the E-step we get a hard figure-ground segmentation No ‘shape-prior’ knowledge is necessary for the segmentation

generative model contains information about shape variation

Combination of bottom-up & top-down detection

On false alarm locations the object model manages to reconstruct the image appearance only by chance, thereby typically getting a small image support for the object.

Spatio-Temporal Visual Attention I: Video Analysis

Create video volume Feature extraction from spatiotemporal dataFusion & saliency generation


Use spatiotemporal VA for efficient global classification of videos Claim: features extracted only from low or high saliency

regions are more representative of the input video

Foreground/Background segmentationClaim: most salient regions are related to foreground

areas of the video

Spatio-Temporal Visual Attention II: Classification & segmentation

iccs-ntua contributions to e-teams of muscle wp6 and wp10

Documents

image appearance

muscleiccs ntua example

iccsntua contributions

morphable models

dynamic models

object model

image observationssegmentation

object detection