lecture 11: video structure, representation, and image retrieval...

61
Lecture 11: Video Structure, Representation, and Image Retrieval system Dr Jing Chen NICTA & CSE UNSW CS9519 Multimedia Systems S2 2008 [email protected]

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

Lecture 11: Video Structure, Representation, and

Image Retrieval system

Dr Jing ChenNICTA & CSE UNSW

CS9519 Multimedia SystemsS2 2008

[email protected]

Page 2: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 2 – J Chen

Last week…Image Sampling

Purpose of samplingAliasingSmoothing before sampling

Linear Image Filter & Convolution

Kernel function, Convolution Box (Average) Filter, Gaussian KernelSmoothing with a Gaussian Kernel

Edge DetectionWhat cause an edgeUsing GradientSmoothing before getting gradient (?)

Page 3: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 3 – J Chen

OutlineVideo Structure and Representation

Frame, shot, scene, storyShot segmentationScene segmentation

Image Retrieval SystemQuery TypesVideo indexingMultimedia retrieval applications

Future research

Page 4: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 4 – J Chen

Video StructureVideo is composed of “words” (structural elements) and a “language” (composition rules)Four structural levels

FrameThe set of all picture elements that represent one complete image

ShotA series of interrelated consecutive frames taken continuously by a single camera and representing a continuous action in time and space

SceneA sequence of shots with a common object of interest, a common location or a common thematic conceptIndividual shots are joined together to make a particular scene through transitions

StoryA sequence of scenes with a coherent focus which contain at least two independent declarative clausesDefinition between scene and story is somehow blurredScene and story will be used interchangeably in the remaining of this lecture

Page 5: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 5 – J Chen

A diagram of video structure

video data

...

...

shot #1 shot #2 shot #3 shot #4 shot #19 shot #20 shot #21

shots

Scenes (stories)

...

scene #1 scene #2 scene #8

keyframe keyframe keyframe keyframe keyframe keyframe keyframe

* H.B.Kang, Video Abstraction Techniques For A Digital Library

Page 6: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 6 – J Chen

Another possible structurestory

scene1 scene2

shot1 shot2 shot3 shot4 shot5

frames1,2,………………………………………………………………N

Page 7: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 7 – J Chen

Why to develop a video structure?Typical video retrieval system diagram

Some applications:Indexing and browsingNon-linear editingEvent detection

Video data

Segmentation

Key frame computation

Feature extraction

Color Motion Shape …

Video query, retrieval and production

Video browsing

* Yan Liu & Fei Li

Page 8: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 8 – J Chen

Outline

Structure of videoFrame, shot, scene, storyWhy we need to do video structure analysis

Shot segmentation

Scene segmentation

Page 9: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 9 – J Chen

Shot segmentation

Shot is the most fundamental semantic element in videos

A physical unitShot boundaries are determined by editing points

Segmentation accuracy affected by the variety of transition effects

Intensively studied with some mature algorithms~90-95% accuracy for cuts~80% for gradual transitions

Page 10: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 10 – J Chen

Transition effects between shots

Transitions: from one camera angle to another camera angle or from one camera to another cameraMajor types:

Cut Abrupt transition between two shots

FadeThe overall value of the shot increases or decreases into a frame of just one color.

fade out and fade in: a slow decrease/increase in brightness.For example, a fade out to black may indicate the end of the sequence.

Dissolve:Shows one image superimposed on the other as the frames of the first shot get dimmer and these of the second one get brighterDissolves are used frequently to indicate a passage of time.

Wipe:when one shot wipes across the frame and replaces the previous shot. Can move in any direction and open one side to the other or they can start in the center and move out or the edge of the frame and move in.

* http://www.siggraph.org/education/materials/HyperGraph/animation/cameras/traditional_film_camera_techniqu.htm

Page 11: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 11 – J Chen

Examples -- Cut

Demo 20041030_15k_cropped(1315, 4736)

Page 12: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 12 – J Chen

Examples -- Fade out & Fade in

* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.

Page 13: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 13 – J Chen

Examples -- Dissolve and cut

* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.

Also see 20041030_15k_cropped_50.yuv

Page 14: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 14 – J Chen

Demo 20041030_15k_cropped_3540.yuv

Examples -- Wipe

Page 15: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 15 – J Chen

Principles of shot boundary detection

An instance of time series segmentation problemTwo approaches

To detect changes in the time seriesOr, to cluster/annotate samples in the time series

t

Page 16: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 16 – J Chen

Challenges of shot boundary detectionDesired: accurate detection of transitions and transition types (frame-accurate location)

Challenges,Fast motion of object or camera,Fast illumination changes,Fire, smoke, flags in the wind, sea waves, …Specularities, shadows, reflections from glass, water, …Instantaneous illumination changes due to flash photographyVery short shots (up to single-frame “shots”)Very long gradual transitionsText overlay, graphics, animationScreen split, video in videoVideo artifacts: MPEG errors, compression noise, camera noise, …

* Arnon Amir

Page 17: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 17 – J Chen

Shot/scene/story segmentation

Representation Features Detection

VideoAudio

•Pixel values

•Sample values

•Compressed data

•Histograms

•Edge

•Texture

•Motion

•DCT

•Audio types

•Thresholding•Statistical•Model driven

* S.F.Chang

Page 18: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 18 – J Chen

Shot boundary detection by discontinuity detection

Use features to represent video frames

Extract continuities in video from the features

Find discontinuities in the sequence == shot boundariesUsually through thresholding

Page 19: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 19 – J Chen

Shot boundary detection by frame differencing

Compute the mean absolute change of intensity I(x,y) between frames k and k+l for all frame pixels.

T. Kikukawa and S. Kawafuchi, “Development of an automatic summary editing system for the audio visual resources,” Trans. Inst. Electron., Inform., Commun. Eng., vol. J75-A, no. 2, pp. 204–212, 1992.

A variation: only counting the pixels that change considerably from one frame to another (above a threshold T1)

K. Otsuji, Y. Tonomura, and Y. Ohba, “Video browsing using brightness data,” in Proc. SPIE/IS&T VCIP’91, vol. 1606, 1991, pp. 980–989.

In both approaches the decision is made by comparing with a thresholdProblem with the above two approaches

Sensitive to discontinuity due to camera or object motion, flash, etcEven shifting the whole image by one pixel may cause troubles!

Page 20: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 20 – J Chen

Frame difference example

tAverage frame diff

tAverage frame diff

Page 21: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 21 – J Chen

A closer look

Frame t frame t+1 difference image (average pixel value difference: 3.2)

Frame t frame t spatially shifted by 1 difference image (average pixel value difference: 19.0)

Page 22: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 22 – J Chen

Shot boundary detection by motion compensated difference

Divide frame k into non-overlapping blocks

For every block in frame k, find the best motion compensated block in frame k+l and compute the between these two blocks

The final difference between frames k and k+l are weighted summary of normalized differences (in the range of 0 to 1) between all corresponding motion compensated blocks

The shot boundary detection decision is again based on thresholding

ProblemRequires computationally complex motion estimationNeeds to distinguish from gradual transitions

B. Shahraray, “Scene change detection and content-based sampling of video sequences,” in Proc. IS&T/SPIE, vol. 2419, Feb. 1995, pp. 2–13.

Page 23: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 23 – J Chen

Shot boundary detection by histogram comparison

Advantages:Histograms are less sensitive to camera and object motionHistograms are invariant to image rotation and change slowly under the variations of viewing angle and scale

Disadvantages:Histograms discard spatial informationTwo images with similar histograms may have completely different content

A solution to this problem is to use joint histograms: color, edge density, texture, etc.

Greg Pass, Ramin Zabih: Comparing Images Using Joint Histograms. Multimedia Syst. 7(3): 234-240 (1999).

Page 24: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 24 – J Chen

Global histogram comparisonA cut is declared if the absolute sum of histogram differences between two successive frames is greater than a thresholdVarious color spaces and various histogram similarity metrics have been tried by different authorsThe twin-comparison algorithm uses the difference between consecutive frames to detect a cut, and the accumulated difference over a sequence of frames to detect gradual transitions

H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems 1(1) (1993) 10-28.

* Irena Koprinska, Sergio Carrato

Page 25: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 25 – J Chen

Local histogram comparisonCombining histogram based and block based comparisons to take spatial information into account

A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-video search for object appearances,” in Visual Database Systems II, E. Knuth and L. M.Wegner, Eds. Amsterdam, The Netherlands: North-Holland, 1992, pp. 113–127.

Both images k and k+l are divided into 16 blocks, Histograms are computed for the all blocks The x2-test is used to compare corresponding block histogramsWhen computing the discontinuity as a sum of region-histogram differences, eight largest differences were discarded to efficiently reduce the influence of motion and noise

Less sensitive to shot boundary changesSome boundaries only involve partial color histogram changes in the video

Page 26: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 26 – J Chen

Overview of some shot segmentation methods

Frame differencing

Motion compensated frame difference

Global histogram

Joint histograms

Local histogram

Page 27: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 27 – J Chen

Thresholding vs Clustering based shot segmentation

ThresholdingLocal decision (based on the info of very few frames)Thresholds are typically highly sensitive to the type of input video

ClusteringView shot segmentation as a k-class unsupervised clustering problemAssign frames to one of the k classes via k-meansGlobal decisionNot only eliminates the need for threshold setting but also allows multiple features to be used simultaneously to improve the performance

Page 28: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 28 – J Chen

Clustering based shot segmentationCompute the color-content dissimilarity of each frame pair using a histogram-based frame comparison metric

x2 test

Or histogram difference (L1 distance)

Based on these metric values, classify frames into two classes (shot boundaries and non-boundaries) using the K-means algorithmLabel all frames in the cluster with the largest mean as boundaries

B. Günsel, A. M. Ferman, A. M. Tekalp, Temporal video segmentation using unsupervised clustering and semantic object tracking, Journal of Electronic Imaging, 7(3), (1998) 592-604.A. M. Ferman, A. M. Tekalp, Efficient filtering and clustering for temporal video segmentation and visual summarization, Journal of Visual Communication and Image Representation 9(4) (1998) 3368-351.

Page 29: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 29 – J Chen

Summary of video shot segmentation methods

Pixel domain approachesFrame differencingMotion compensated frame differencingHistograms (global, joint and local)Model driven

Threshold vs. Clustering

Page 30: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 30 – J Chen

Scene segmentation

Scene: a series of consecutive shots with a common focus

Scene/story segmentation usually is based on shot segmentationShot segmentation + shot clustering/grouping

Page 31: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 31 – J Chen

Representing video shots using representative frames (R-frames)

The need to represent video using less framesA 90 minutes movie at 24 frames per second has 90*60*24=129,600 frames!

R-frames

R frames are selected frames in a video shotR frame RA (== original frame number 1) represents frames 1 to i where the distance between all these i frames and RA <= threshold εRB is the first frame in all frames following RA where the distance between RA and RB > threshold ε

frames1,2,………………i, i+1,…………j, j+1,………………………N

R-frames RA RB RC . . .

Page 32: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 32 – J Chen

Distance between two shotsGiven two shots Si and Sj represented by two sets of R-frames, the distance between two shots is

where fll and fm enumerate through all R frames in shots i and j, respectively

( ) ( )mlji ffDSSd ,min, ≈

R-frames for shot i RAi RBi RCi . . .

R-frames for shot j RAj RBj RCj RCj. . .

Page 33: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 33 – J Chen

Visual dissimilarity between two clustersClustering: to groups shots into clusters

Visual dissimilarity between two clusters is defined as the maximum shot similarity between all possible shot similarity comparisons

where Ci is the ith cluster

Follows the same principle as in shot distance definition

Page 34: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 34 – J Chen

Criteria for time constrained clustering of video shots

TemporalIn each cluster, no two frames can be separated by more than Tframes apart

Visual similarityThe visual similarity between all shot-pairs in a cluster should be <= a threshold ε

Measures intra-cluster qualityThe similarity between all clusters should be > threshold ε

Measures inter-cluster qualityFollows the same principle when extracting R-frames

Page 35: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 35 – J Chen

From video to scene transition graph (Yeung, Yeo & Liu ’98)

Page 36: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 36 – J Chen

Scene segmentation with scene transition graph (Yeung, Yeo & Liu ’98)

A scene transition graph is a compact representation of a video that is a directed graph G=(V,E)

V: vertices. Each vertex represents a cluster of similar shotsE: edges. Two vertices i, j are connected via an edge if there exists a shot in cluster i that precedes a shot in cluster j

Scene segmentation is to finding cuts in the graphA cut edge is an edge when removed will result in two disconnected graphs

Page 37: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 37 – J Chen

GraphG=(V,E)

V: vertices. E: edges.

Page 38: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 38 – J Chen

A diagram of scene segmentation with scene transition graph (YYL ’98)

Note temporal constraint

Page 39: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 39 – J Chen

Scene Transition Graph

Page 40: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 40 – J Chen

Video Browser -- Key-frame Based Hierarchical Video Browser

* H.J.Zhang et al, Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution, ACM MM 2005

Page 41: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 41 – J Chen

Video Browser -- Browser Layout for Key-Frame Based Hierarchical Video Browser

Page 42: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 42 – J Chen

OutlineStructure of video

Frame, shot, scene, storyWhy need video structure analysis

Shot segmentationScene segmentationVideo Browser

Introduction to Query and Image RetrievalMultimedia retrieval applicationsFuture research

Page 43: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 43 – J Chen

Feature-based Similarity Search

Video Query

Page 44: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 44 – J Chen

Query typesPoint query

specifies a point in the data space and retrieves all point objects in the database with identical coordinates:

Range queryGiven a query point Q, a distance r, and a distance function M, retrieve all points P from the database, which have a distance smaller or equal to r from Q according to M:

Nearest neighbor queryGiven a query point Q, retrieve the nearest neighbor point P from the database, ie, find object

K-nearest neighbor queryGiven a query point, return the k nearest neighbor points

Page 45: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 45 – J Chen

Distance functionsEuclidean (L2) Manhattan (L1)Maximum (L∞)Weighted Euclidean Weighted maximum Ellipsoid where W is a positive definite similarity matrix

Page 46: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 46 – J Chen

Diagram of Image Retrieval System

Query specification

Feature extraction

Image database

Feature extraction

Image data

Similarity comparison

Indexing & retrieval

Retrieval results

Relevance feedback

user Feature vectors

output

Feature vectors

Feature vectors

Page 47: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 47 – J Chen

Image Retrieval: Client/Server mode

MPEG-7 Database

Media Server

Search Engine Query target

Query Response List of matchingcontentclip1.mp4clip2.mp4…

Request for ContentRTSP

Streaming MediaRTP/RTCP

Page 48: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 48 – J Chen

Application: IBM’s QBIC

QBIC – Query by Image Content

First commercial CBIR system.

Model system – influenced many others.

Uses color, texture, shape features

Text-based search can also be combined.

Uses R*-trees for indexing

*Deepak Bote

Page 49: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 49 – J Chen

QBIC – Search by color

** Images courtesy : Yong Rao

Page 50: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 50 – J Chen

QBIC – Search by shape

** Images courtesy : Yong Rao

Page 51: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 51 – J Chen

QBIC – Query by sketch

** Images courtesy : Yong Rao

Page 52: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 52 – J Chen

Architecture of QBIC Query System

Page 53: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 53 – J Chen

QBIC fast search and indexing

Pre-filteringA computationally fast filter is applied to all data

Lower bounding the histogram color distance via computing the distance based on average colors

Only items that pass through the filter are operated on by the second stage (which computes true similarity metric)

IndexingPCA and R*-tree

Page 54: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 54 – J Chen

QBIC demohttp://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English

Page 55: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 55 – J Chen

Some applications of content based image and video retrieval

WWW search

Digital library

Stock photography for remote printing

Search for stock images and clip art then print out

In the textile industry

Store and retrieve textile design digitally

Corporate videos, training and educational videos

At home

Page 56: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 56 – J Chen

Future directions – Multimedia

R. Jain (Georgia Institute of Technology): “Multimedia research is not keeping pace with the technology. In research, multimedia has to become multimedia.”

“Email will become more and more audio and video. Text mail will remain important, but audio and video mail will become popular and someday overtake text in its utility. Storage, retrieval and presentation of true multimedia will emerge. Current thinking about audio and video being an appendix to text will not take ustoo far.”

“Multimedia research must address how the semantics of the situation emerges out of individual data sources independent of the medium represented by the source. This issue of semantics is critical to all the applications of multimedia.”

Page 57: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 57 – J Chen

Future directions – Multimedia

D. Gibbon (AT&T Research Labs):

“(…) We must now focus on less structured media.

Media processing algorithms such as speech recognition and scene change detection have been proven to work well enough on clean, structured data, but are known to have poorer performance in the presence of noise or lack of structure.

Professionally produced media amounts to only the tip of the iceberg –if we include video teleconferencing, voice communications, video surveillance, etc. then the bulk of media has very little structure that is useful for indexing in adatabasesystem. This latter class of media represents a potentially richapplication space.”

Page 58: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 58 – J Chen

Future directions – Multimedia

T.S. Huang (University of Illinois at Urbana-Champiagn):

“Current real-world useful video databases, where the retrieval modes are flexible, and based on a combination of keywords and visual and audio contents, are nonexistent. We hope that such flexible systems will in the future find applications in: sport events, broadcasting news, documentaries, education and training, home videos, and above all biomedicine.”

Page 59: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 59 – J Chen

Outline

Structure of videoFrame, shot, scene, storyWhy need video structure analysis

Shot segmentationScene segmentationVideo Browser

Introduction to query and indexingMultimedia retrieval applicationsFuture research

Page 60: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 60 – J Chen

ReferencesChapter 4 of Book Multimedia Information Retrieval and Management An Interactive Comic Book Presentation for Exploring Video.John Boreczky, Andreas Girgensohn, Gene Golovchinsky, and Shingo Uchihashi in CHI 2000 Conference Proceedings, ACM Press, pp. 185-192, 2000., April 1, 2000 Christel, M., Smith, M., Taylor, C.R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions. In Proc. ACM CHI ’98(Los Angeles, CA, April 1998), ACM Press, 171-178Christel, M., Winkler, D., and Taylor, C.R. Improving Access to a Digital Video Library. In Human-Computer Interaction: INTERACT97, Chapman & Hall, London, 1997, 524-531

Page 61: Lecture 11: Video Structure, Representation, and Image Retrieval …cs9519/lecture_notes_08/L11_COMP... · 2008. 10. 12. · time and space zScene zA sequence of shots with a common

COMP9519 Multimedia Systems – Lecture 11 – Slide 61 – J Chen

ReferencesChristian Böhm, Stefan Berchtold, Daniel Keim: Searching in High-dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases, ACM Computing Surveys 33 (3), 2001. Slides by R.F.MoellerSlides by Deepak BoteSlides by Ahmet SenturkGuttman, Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57Guojun Lu, Techniques and Data Structures for Efficient Multimedia Retrieval based on Similarity, IEEE Transactions on Multimedia, Vol 4, No 3, September 2002 Anne.H.H.Ngu, Q. Z.Sheng, D. Q.Huynh, and R. Lei. Combining multi-visual features for efficient indexing in a large image database. The VLDB Journal, 9(4):280--293, May 2001 Jing Chen, project report, NICTAC. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994M. Flickner, et. al., "Query by Image and Video Content: The QBIC System," IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995. Slides by Oge Marques