Download - Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

Lecture 11: Video Structure, Representation, and

Image Retrieval system

Dr Jing ChenNICTA & CSE UNSW

CS9519 Multimedia SystemsS2 2008

[email protected]

COMP9519 Multimedia Systems – Lecture 11 – Slide 2 – J Chen

Last week…Image Sampling

Purpose of samplingAliasingSmoothing before sampling

Linear Image Filter & Convolution

Kernel function, Convolution Box (Average) Filter, Gaussian KernelSmoothing with a Gaussian Kernel

Edge DetectionWhat cause an edgeUsing GradientSmoothing before getting gradient (?)


OutlineVideo Structure and Representation

Frame, shot, scene, storyShot segmentationScene segmentation

Image Retrieval SystemQuery TypesVideo indexingMultimedia retrieval applications

Future research


Video StructureVideo is composed of “words” (structural elements) and a “language” (composition rules)Four structural levels

FrameThe set of all picture elements that represent one complete image

ShotA series of interrelated consecutive frames taken continuously by a single camera and representing a continuous action in time and space

SceneA sequence of shots with a common object of interest, a common location or a common thematic conceptIndividual shots are joined together to make a particular scene through transitions

StoryA sequence of scenes with a coherent focus which contain at least two independent declarative clausesDefinition between scene and story is somehow blurredScene and story will be used interchangeably in the remaining of this lecture


A diagram of video structure

video data

...

...

shot #1 shot #2 shot #3 shot #4 shot #19 shot #20 shot #21

shots

Scenes (stories)

...

scene #1 scene #2 scene #8

keyframe keyframe keyframe keyframe keyframe keyframe keyframe

* H.B.Kang, Video Abstraction Techniques For A Digital Library


Another possible structurestory

scene1 scene2

shot1 shot2 shot3 shot4 shot5

frames1,2,………………………………………………………………N


Why to develop a video structure?Typical video retrieval system diagram

Some applications:Indexing and browsingNon-linear editingEvent detection

Video data

Segmentation

Key frame computation

Feature extraction

Color Motion Shape …

Video query, retrieval and production

Video browsing

* Yan Liu & Fei Li


Outline

Structure of videoFrame, shot, scene, storyWhy we need to do video structure analysis

Shot segmentation

Scene segmentation


Shot segmentation

Shot is the most fundamental semantic element in videos

A physical unitShot boundaries are determined by editing points

Segmentation accuracy affected by the variety of transition effects

Intensively studied with some mature algorithms~90-95% accuracy for cuts~80% for gradual transitions


Transition effects between shots

Transitions: from one camera angle to another camera angle or from one camera to another cameraMajor types:

Cut Abrupt transition between two shots

FadeThe overall value of the shot increases or decreases into a frame of just one color.

fade out and fade in: a slow decrease/increase in brightness.For example, a fade out to black may indicate the end of the sequence.

Dissolve:Shows one image superimposed on the other as the frames of the first shot get dimmer and these of the second one get brighterDissolves are used frequently to indicate a passage of time.

Wipe:when one shot wipes across the frame and replaces the previous shot. Can move in any direction and open one side to the other or they can start in the center and move out or the edge of the frame and move in.

* http://www.siggraph.org/education/materials/HyperGraph/animation/cameras/traditional_film_camera_techniqu.htm


Examples -- Cut

Demo 20041030_15k_cropped(1315, 4736)


Examples -- Fade out & Fade in

* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.


Examples -- Dissolve and cut

* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.

Also see 20041030_15k_cropped_50.yuv


Demo 20041030_15k_cropped_3540.yuv

Examples -- Wipe


Principles of shot boundary detection

An instance of time series segmentation problemTwo approaches

To detect changes in the time seriesOr, to cluster/annotate samples in the time series

t


Challenges of shot boundary detectionDesired: accurate detection of transitions and transition types (frame-accurate location)

Challenges,Fast motion of object or camera,Fast illumination changes,Fire, smoke, flags in the wind, sea waves, …Specularities, shadows, reflections from glass, water, …Instantaneous illumination changes due to flash photographyVery short shots (up to single-frame “shots”)Very long gradual transitionsText overlay, graphics, animationScreen split, video in videoVideo artifacts: MPEG errors, compression noise, camera noise, …

* Arnon Amir


Shot/scene/story segmentation

Representation Features Detection

VideoAudio

•Pixel values

•Sample values

•Compressed data

•Histograms

•Edge

•Texture

•Motion

•DCT

•Audio types

•Thresholding•Statistical•Model driven

* S.F.Chang


Shot boundary detection by discontinuity detection

Use features to represent video frames

Extract continuities in video from the features

Find discontinuities in the sequence == shot boundariesUsually through thresholding


Shot boundary detection by frame differencing

Compute the mean absolute change of intensity I(x,y) between frames k and k+l for all frame pixels.

T. Kikukawa and S. Kawafuchi, “Development of an automatic summary editing system for the audio visual resources,” Trans. Inst. Electron., Inform., Commun. Eng., vol. J75-A, no. 2, pp. 204–212, 1992.

A variation: only counting the pixels that change considerably from one frame to another (above a threshold T1)

K. Otsuji, Y. Tonomura, and Y. Ohba, “Video browsing using brightness data,” in Proc. SPIE/IS&T VCIP’91, vol. 1606, 1991, pp. 980–989.

In both approaches the decision is made by comparing with a thresholdProblem with the above two approaches

Sensitive to discontinuity due to camera or object motion, flash, etcEven shifting the whole image by one pixel may cause troubles!


Frame difference example

tAverage frame diff

tAverage frame diff


A closer look

Frame t frame t+1 difference image (average pixel value difference: 3.2)

Frame t frame t spatially shifted by 1 difference image (average pixel value difference: 19.0)


Shot boundary detection by motion compensated difference

Divide frame k into non-overlapping blocks

For every block in frame k, find the best motion compensated block in frame k+l and compute the between these two blocks

The final difference between frames k and k+l are weighted summary of normalized differences (in the range of 0 to 1) between all corresponding motion compensated blocks

The shot boundary detection decision is again based on thresholding

ProblemRequires computationally complex motion estimationNeeds to distinguish from gradual transitions

B. Shahraray, “Scene change detection and content-based sampling of video sequences,” in Proc. IS&T/SPIE, vol. 2419, Feb. 1995, pp. 2–13.


Shot boundary detection by histogram comparison

Advantages:Histograms are less sensitive to camera and object motionHistograms are invariant to image rotation and change slowly under the variations of viewing angle and scale

Disadvantages:Histograms discard spatial informationTwo images with similar histograms may have completely different content

A solution to this problem is to use joint histograms: color, edge density, texture, etc.

Greg Pass, Ramin Zabih: Comparing Images Using Joint Histograms. Multimedia Syst. 7(3): 234-240 (1999).


Global histogram comparisonA cut is declared if the absolute sum of histogram differences between two successive frames is greater than a thresholdVarious color spaces and various histogram similarity metrics have been tried by different authorsThe twin-comparison algorithm uses the difference between consecutive frames to detect a cut, and the accumulated difference over a sequence of frames to detect gradual transitions

H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems 1(1) (1993) 10-28.

* Irena Koprinska, Sergio Carrato


Local histogram comparisonCombining histogram based and block based comparisons to take spatial information into account

A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-video search for object appearances,” in Visual Database Systems II, E. Knuth and L. M.Wegner, Eds. Amsterdam, The Netherlands: North-Holland, 1992, pp. 113–127.

Both images k and k+l are divided into 16 blocks, Histograms are computed for the all blocks The x2-test is used to compare corresponding block histogramsWhen computing the discontinuity as a sum of region-histogram differences, eight largest differences were discarded to efficiently reduce the influence of motion and noise

Less sensitive to shot boundary changesSome boundaries only involve partial color histogram changes in the video


Overview of some shot segmentation methods

Frame differencing

Motion compensated frame difference

Global histogram

Joint histograms

Local histogram


Thresholding vs Clustering based shot segmentation

ThresholdingLocal decision (based on the info of very few frames)Thresholds are typically highly sensitive to the type of input video

ClusteringView shot segmentation as a k-class unsupervised clustering problemAssign frames to one of the k classes via k-meansGlobal decisionNot only eliminates the need for threshold setting but also allows multiple features to be used simultaneously to improve the performance


Clustering based shot segmentationCompute the color-content dissimilarity of each frame pair using a histogram-based frame comparison metric

x2 test

Or histogram difference (L1 distance)

Based on these metric values, classify frames into two classes (shot boundaries and non-boundaries) using the K-means algorithmLabel all frames in the cluster with the largest mean as boundaries

B. Günsel, A. M. Ferman, A. M. Tekalp, Temporal video segmentation using unsupervised clustering and semantic object tracking, Journal of Electronic Imaging, 7(3), (1998) 592-604.A. M. Ferman, A. M. Tekalp, Efficient filtering and clustering for temporal video segmentation and visual summarization, Journal of Visual Communication and Image Representation 9(4) (1998) 3368-351.


Summary of video shot segmentation methods

Pixel domain approachesFrame differencingMotion compensated frame differencingHistograms (global, joint and local)Model driven

Threshold vs. Clustering


Scene segmentation

Scene: a series of consecutive shots with a common focus

Scene/story segmentation usually is based on shot segmentationShot segmentation + shot clustering/grouping


Representing video shots using representative frames (R-frames)

The need to represent video using less framesA 90 minutes movie at 24 frames per second has 90*60*24=129,600 frames!

R-frames

R frames are selected frames in a video shotR frame RA (== original frame number 1) represents frames 1 to i where the distance between all these i frames and RA <= threshold εRB is the first frame in all frames following RA where the distance between RA and RB > threshold ε

frames1,2,………………i, i+1,…………j, j+1,………………………N

R-frames RA RB RC . . .


Distance between two shotsGiven two shots Si and Sj represented by two sets of R-frames, the distance between two shots is

where fll and fm enumerate through all R frames in shots i and j, respectively

( ) ( )mlji ffDSSd ,min, ≈

R-frames for shot i RAi RBi RCi . . .

R-frames for shot j RAj RBj RCj RCj. . .


Visual dissimilarity between two clustersClustering: to groups shots into clusters

Visual dissimilarity between two clusters is defined as the maximum shot similarity between all possible shot similarity comparisons

where Ci is the ith cluster

Follows the same principle as in shot distance definition


Criteria for time constrained clustering of video shots

TemporalIn each cluster, no two frames can be separated by more than Tframes apart

Visual similarityThe visual similarity between all shot-pairs in a cluster should be <= a threshold ε

Measures intra-cluster qualityThe similarity between all clusters should be > threshold ε

Measures inter-cluster qualityFollows the same principle when extracting R-frames


From video to scene transition graph (Yeung, Yeo & Liu ’98)


Scene segmentation with scene transition graph (Yeung, Yeo & Liu ’98)

A scene transition graph is a compact representation of a video that is a directed graph G=(V,E)

V: vertices. Each vertex represents a cluster of similar shotsE: edges. Two vertices i, j are connected via an edge if there exists a shot in cluster i that precedes a shot in cluster j

Scene segmentation is to finding cuts in the graphA cut edge is an edge when removed will result in two disconnected graphs


GraphG=(V,E)

V: vertices. E: edges.


A diagram of scene segmentation with scene transition graph (YYL ’98)

Note temporal constraint


Scene Transition Graph


Video Browser -- Key-frame Based Hierarchical Video Browser

* H.J.Zhang et al, Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution, ACM MM 2005


Video Browser -- Browser Layout for Key-Frame Based Hierarchical Video Browser


OutlineStructure of video

Frame, shot, scene, storyWhy need video structure analysis

Shot segmentationScene segmentationVideo Browser

Introduction to Query and Image RetrievalMultimedia retrieval applicationsFuture research


Feature-based Similarity Search

Video Query


Query typesPoint query

specifies a point in the data space and retrieves all point objects in the database with identical coordinates:

Range queryGiven a query point Q, a distance r, and a distance function M, retrieve all points P from the database, which have a distance smaller or equal to r from Q according to M:

Nearest neighbor queryGiven a query point Q, retrieve the nearest neighbor point P from the database, ie, find object

K-nearest neighbor queryGiven a query point, return the k nearest neighbor points


Distance functionsEuclidean (L2) Manhattan (L1)Maximum (L∞)Weighted Euclidean Weighted maximum Ellipsoid where W is a positive definite similarity matrix


Diagram of Image Retrieval System

Query specification

Feature extraction

Image database

Feature extraction

Image data

Similarity comparison

Indexing & retrieval

Retrieval results

Relevance feedback

user Feature vectors

output

Feature vectors

Feature vectors


Image Retrieval: Client/Server mode

MPEG-7 Database

Media Server

Search Engine Query target

Query Response List of matchingcontentclip1.mp4clip2.mp4…

Request for ContentRTSP

Streaming MediaRTP/RTCP


Application: IBM’s QBIC

QBIC – Query by Image Content

First commercial CBIR system.

Model system – influenced many others.

Uses color, texture, shape features

Text-based search can also be combined.

Uses R*-trees for indexing

*Deepak Bote


QBIC – Search by color

** Images courtesy : Yong Rao


QBIC – Search by shape



QBIC – Query by sketch



Architecture of QBIC Query System


QBIC fast search and indexing

Pre-filteringA computationally fast filter is applied to all data

Lower bounding the histogram color distance via computing the distance based on average colors

Only items that pass through the filter are operated on by the second stage (which computes true similarity metric)

IndexingPCA and R*-tree


QBIC demohttp://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English


Some applications of content based image and video retrieval

WWW search

Digital library

Stock photography for remote printing

Search for stock images and clip art then print out

In the textile industry

Store and retrieve textile design digitally

Corporate videos, training and educational videos

At home


Future directions – Multimedia

R. Jain (Georgia Institute of Technology): “Multimedia research is not keeping pace with the technology. In research, multimedia has to become multimedia.”

“Email will become more and more audio and video. Text mail will remain important, but audio and video mail will become popular and someday overtake text in its utility. Storage, retrieval and presentation of true multimedia will emerge. Current thinking about audio and video being an appendix to text will not take ustoo far.”

“Multimedia research must address how the semantics of the situation emerges out of individual data sources independent of the medium represented by the source. This issue of semantics is critical to all the applications of multimedia.”



D. Gibbon (AT&T Research Labs):

“(…) We must now focus on less structured media.

Media processing algorithms such as speech recognition and scene change detection have been proven to work well enough on clean, structured data, but are known to have poorer performance in the presence of noise or lack of structure.

Professionally produced media amounts to only the tip of the iceberg –if we include video teleconferencing, voice communications, video surveillance, etc. then the bulk of media has very little structure that is useful for indexing in adatabasesystem. This latter class of media represents a potentially richapplication space.”



T.S. Huang (University of Illinois at Urbana-Champiagn):

“Current real-world useful video databases, where the retrieval modes are flexible, and based on a combination of keywords and visual and audio contents, are nonexistent. We hope that such flexible systems will in the future find applications in: sport events, broadcasting news, documentaries, education and training, home videos, and above all biomedicine.”


Outline

Structure of videoFrame, shot, scene, storyWhy need video structure analysis

Shot segmentationScene segmentationVideo Browser

Introduction to query and indexingMultimedia retrieval applicationsFuture research


ReferencesChapter 4 of Book Multimedia Information Retrieval and Management An Interactive Comic Book Presentation for Exploring Video.John Boreczky, Andreas Girgensohn, Gene Golovchinsky, and Shingo Uchihashi in CHI 2000 Conference Proceedings, ACM Press, pp. 185-192, 2000., April 1, 2000 Christel, M., Smith, M., Taylor, C.R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions. In Proc. ACM CHI ’98(Los Angeles, CA, April 1998), ACM Press, 171-178Christel, M., Winkler, D., and Taylor, C.R. Improving Access to a Digital Video Library. In Human-Computer Interaction: INTERACT97, Chapman & Hall, London, 1997, 524-531


ReferencesChristian Böhm, Stefan Berchtold, Daniel Keim: Searching in High-dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases, ACM Computing Surveys 33 (3), 2001. Slides by R.F.MoellerSlides by Deepak BoteSlides by Ahmet SenturkGuttman, Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57Guojun Lu, Techniques and Data Structures for Efficient Multimedia Retrieval based on Similarity, IEEE Transactions on Multimedia, Vol 4, No 3, September 2002 Anne.H.H.Ngu, Q. Z.Sheng, D. Q.Huynh, and R. Lei. Combining multi-visual features for efficient indexing in a large image database. The VLDB Journal, 9(4):280--293, May 2001 Jing Chen, project report, NICTAC. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994M. Flickner, et. al., "Query by Image and Video Content: The QBIC System," IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995. Slides by Oge Marques

Download - Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

Top Related