Lecture 11: Video Structure, Representation, and
Image Retrieval system
Dr Jing ChenNICTA & CSE UNSW
CS9519 Multimedia SystemsS2 2008
COMP9519 Multimedia Systems – Lecture 11 – Slide 2 – J Chen
Last week…Image Sampling
Purpose of samplingAliasingSmoothing before sampling
Linear Image Filter & Convolution
Kernel function, Convolution Box (Average) Filter, Gaussian KernelSmoothing with a Gaussian Kernel
Edge DetectionWhat cause an edgeUsing GradientSmoothing before getting gradient (?)
COMP9519 Multimedia Systems – Lecture 11 – Slide 3 – J Chen
OutlineVideo Structure and Representation
Frame, shot, scene, storyShot segmentationScene segmentation
Image Retrieval SystemQuery TypesVideo indexingMultimedia retrieval applications
Future research
COMP9519 Multimedia Systems – Lecture 11 – Slide 4 – J Chen
Video StructureVideo is composed of “words” (structural elements) and a “language” (composition rules)Four structural levels
FrameThe set of all picture elements that represent one complete image
ShotA series of interrelated consecutive frames taken continuously by a single camera and representing a continuous action in time and space
SceneA sequence of shots with a common object of interest, a common location or a common thematic conceptIndividual shots are joined together to make a particular scene through transitions
StoryA sequence of scenes with a coherent focus which contain at least two independent declarative clausesDefinition between scene and story is somehow blurredScene and story will be used interchangeably in the remaining of this lecture
COMP9519 Multimedia Systems – Lecture 11 – Slide 5 – J Chen
A diagram of video structure
video data
...
...
shot #1 shot #2 shot #3 shot #4 shot #19 shot #20 shot #21
shots
Scenes (stories)
...
scene #1 scene #2 scene #8
keyframe keyframe keyframe keyframe keyframe keyframe keyframe
* H.B.Kang, Video Abstraction Techniques For A Digital Library
COMP9519 Multimedia Systems – Lecture 11 – Slide 6 – J Chen
Another possible structurestory
scene1 scene2
shot1 shot2 shot3 shot4 shot5
frames1,2,………………………………………………………………N
COMP9519 Multimedia Systems – Lecture 11 – Slide 7 – J Chen
Why to develop a video structure?Typical video retrieval system diagram
Some applications:Indexing and browsingNon-linear editingEvent detection
Video data
Segmentation
Key frame computation
Feature extraction
Color Motion Shape …
Video query, retrieval and production
Video browsing
* Yan Liu & Fei Li
COMP9519 Multimedia Systems – Lecture 11 – Slide 8 – J Chen
Outline
Structure of videoFrame, shot, scene, storyWhy we need to do video structure analysis
Shot segmentation
Scene segmentation
COMP9519 Multimedia Systems – Lecture 11 – Slide 9 – J Chen
Shot segmentation
Shot is the most fundamental semantic element in videos
A physical unitShot boundaries are determined by editing points
Segmentation accuracy affected by the variety of transition effects
Intensively studied with some mature algorithms~90-95% accuracy for cuts~80% for gradual transitions
COMP9519 Multimedia Systems – Lecture 11 – Slide 10 – J Chen
Transition effects between shots
Transitions: from one camera angle to another camera angle or from one camera to another cameraMajor types:
Cut Abrupt transition between two shots
FadeThe overall value of the shot increases or decreases into a frame of just one color.
fade out and fade in: a slow decrease/increase in brightness.For example, a fade out to black may indicate the end of the sequence.
Dissolve:Shows one image superimposed on the other as the frames of the first shot get dimmer and these of the second one get brighterDissolves are used frequently to indicate a passage of time.
Wipe:when one shot wipes across the frame and replaces the previous shot. Can move in any direction and open one side to the other or they can start in the center and move out or the edge of the frame and move in.
* http://www.siggraph.org/education/materials/HyperGraph/animation/cameras/traditional_film_camera_techniqu.htm
COMP9519 Multimedia Systems – Lecture 11 – Slide 11 – J Chen
Examples -- Cut
Demo 20041030_15k_cropped(1315, 4736)
COMP9519 Multimedia Systems – Lecture 11 – Slide 12 – J Chen
Examples -- Fade out & Fade in
* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.
COMP9519 Multimedia Systems – Lecture 11 – Slide 13 – J Chen
Examples -- Dissolve and cut
* I. Koprinska and S. Carrato, “Temporal video segmentation: A survey”, Signal Processing: Image communication, Vol. 16, pp. 477-500, 2001.
Also see 20041030_15k_cropped_50.yuv
COMP9519 Multimedia Systems – Lecture 11 – Slide 14 – J Chen
Demo 20041030_15k_cropped_3540.yuv
Examples -- Wipe
COMP9519 Multimedia Systems – Lecture 11 – Slide 15 – J Chen
Principles of shot boundary detection
An instance of time series segmentation problemTwo approaches
To detect changes in the time seriesOr, to cluster/annotate samples in the time series
t
COMP9519 Multimedia Systems – Lecture 11 – Slide 16 – J Chen
Challenges of shot boundary detectionDesired: accurate detection of transitions and transition types (frame-accurate location)
Challenges,Fast motion of object or camera,Fast illumination changes,Fire, smoke, flags in the wind, sea waves, …Specularities, shadows, reflections from glass, water, …Instantaneous illumination changes due to flash photographyVery short shots (up to single-frame “shots”)Very long gradual transitionsText overlay, graphics, animationScreen split, video in videoVideo artifacts: MPEG errors, compression noise, camera noise, …
* Arnon Amir
COMP9519 Multimedia Systems – Lecture 11 – Slide 17 – J Chen
Shot/scene/story segmentation
Representation Features Detection
VideoAudio
•Pixel values
•Sample values
•Compressed data
•Histograms
•Edge
•Texture
•Motion
•DCT
•Audio types
•Thresholding•Statistical•Model driven
* S.F.Chang
COMP9519 Multimedia Systems – Lecture 11 – Slide 18 – J Chen
Shot boundary detection by discontinuity detection
Use features to represent video frames
Extract continuities in video from the features
Find discontinuities in the sequence == shot boundariesUsually through thresholding
COMP9519 Multimedia Systems – Lecture 11 – Slide 19 – J Chen
Shot boundary detection by frame differencing
Compute the mean absolute change of intensity I(x,y) between frames k and k+l for all frame pixels.
T. Kikukawa and S. Kawafuchi, “Development of an automatic summary editing system for the audio visual resources,” Trans. Inst. Electron., Inform., Commun. Eng., vol. J75-A, no. 2, pp. 204–212, 1992.
A variation: only counting the pixels that change considerably from one frame to another (above a threshold T1)
K. Otsuji, Y. Tonomura, and Y. Ohba, “Video browsing using brightness data,” in Proc. SPIE/IS&T VCIP’91, vol. 1606, 1991, pp. 980–989.
In both approaches the decision is made by comparing with a thresholdProblem with the above two approaches
Sensitive to discontinuity due to camera or object motion, flash, etcEven shifting the whole image by one pixel may cause troubles!
COMP9519 Multimedia Systems – Lecture 11 – Slide 20 – J Chen
Frame difference example
tAverage frame diff
tAverage frame diff
COMP9519 Multimedia Systems – Lecture 11 – Slide 21 – J Chen
A closer look
Frame t frame t+1 difference image (average pixel value difference: 3.2)
Frame t frame t spatially shifted by 1 difference image (average pixel value difference: 19.0)
COMP9519 Multimedia Systems – Lecture 11 – Slide 22 – J Chen
Shot boundary detection by motion compensated difference
Divide frame k into non-overlapping blocks
For every block in frame k, find the best motion compensated block in frame k+l and compute the between these two blocks
The final difference between frames k and k+l are weighted summary of normalized differences (in the range of 0 to 1) between all corresponding motion compensated blocks
The shot boundary detection decision is again based on thresholding
ProblemRequires computationally complex motion estimationNeeds to distinguish from gradual transitions
B. Shahraray, “Scene change detection and content-based sampling of video sequences,” in Proc. IS&T/SPIE, vol. 2419, Feb. 1995, pp. 2–13.
COMP9519 Multimedia Systems – Lecture 11 – Slide 23 – J Chen
Shot boundary detection by histogram comparison
Advantages:Histograms are less sensitive to camera and object motionHistograms are invariant to image rotation and change slowly under the variations of viewing angle and scale
Disadvantages:Histograms discard spatial informationTwo images with similar histograms may have completely different content
A solution to this problem is to use joint histograms: color, edge density, texture, etc.
Greg Pass, Ramin Zabih: Comparing Images Using Joint Histograms. Multimedia Syst. 7(3): 234-240 (1999).
COMP9519 Multimedia Systems – Lecture 11 – Slide 24 – J Chen
Global histogram comparisonA cut is declared if the absolute sum of histogram differences between two successive frames is greater than a thresholdVarious color spaces and various histogram similarity metrics have been tried by different authorsThe twin-comparison algorithm uses the difference between consecutive frames to detect a cut, and the accumulated difference over a sequence of frames to detect gradual transitions
H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems 1(1) (1993) 10-28.
* Irena Koprinska, Sergio Carrato
COMP9519 Multimedia Systems – Lecture 11 – Slide 25 – J Chen
Local histogram comparisonCombining histogram based and block based comparisons to take spatial information into account
A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-video search for object appearances,” in Visual Database Systems II, E. Knuth and L. M.Wegner, Eds. Amsterdam, The Netherlands: North-Holland, 1992, pp. 113–127.
Both images k and k+l are divided into 16 blocks, Histograms are computed for the all blocks The x2-test is used to compare corresponding block histogramsWhen computing the discontinuity as a sum of region-histogram differences, eight largest differences were discarded to efficiently reduce the influence of motion and noise
Less sensitive to shot boundary changesSome boundaries only involve partial color histogram changes in the video
COMP9519 Multimedia Systems – Lecture 11 – Slide 26 – J Chen
Overview of some shot segmentation methods
Frame differencing
Motion compensated frame difference
Global histogram
Joint histograms
Local histogram
COMP9519 Multimedia Systems – Lecture 11 – Slide 27 – J Chen
Thresholding vs Clustering based shot segmentation
ThresholdingLocal decision (based on the info of very few frames)Thresholds are typically highly sensitive to the type of input video
ClusteringView shot segmentation as a k-class unsupervised clustering problemAssign frames to one of the k classes via k-meansGlobal decisionNot only eliminates the need for threshold setting but also allows multiple features to be used simultaneously to improve the performance
COMP9519 Multimedia Systems – Lecture 11 – Slide 28 – J Chen
Clustering based shot segmentationCompute the color-content dissimilarity of each frame pair using a histogram-based frame comparison metric
x2 test
Or histogram difference (L1 distance)
Based on these metric values, classify frames into two classes (shot boundaries and non-boundaries) using the K-means algorithmLabel all frames in the cluster with the largest mean as boundaries
B. Günsel, A. M. Ferman, A. M. Tekalp, Temporal video segmentation using unsupervised clustering and semantic object tracking, Journal of Electronic Imaging, 7(3), (1998) 592-604.A. M. Ferman, A. M. Tekalp, Efficient filtering and clustering for temporal video segmentation and visual summarization, Journal of Visual Communication and Image Representation 9(4) (1998) 3368-351.
COMP9519 Multimedia Systems – Lecture 11 – Slide 29 – J Chen
Summary of video shot segmentation methods
Pixel domain approachesFrame differencingMotion compensated frame differencingHistograms (global, joint and local)Model driven
Threshold vs. Clustering
COMP9519 Multimedia Systems – Lecture 11 – Slide 30 – J Chen
Scene segmentation
Scene: a series of consecutive shots with a common focus
Scene/story segmentation usually is based on shot segmentationShot segmentation + shot clustering/grouping
COMP9519 Multimedia Systems – Lecture 11 – Slide 31 – J Chen
Representing video shots using representative frames (R-frames)
The need to represent video using less framesA 90 minutes movie at 24 frames per second has 90*60*24=129,600 frames!
R-frames
R frames are selected frames in a video shotR frame RA (== original frame number 1) represents frames 1 to i where the distance between all these i frames and RA <= threshold εRB is the first frame in all frames following RA where the distance between RA and RB > threshold ε
frames1,2,………………i, i+1,…………j, j+1,………………………N
R-frames RA RB RC . . .
COMP9519 Multimedia Systems – Lecture 11 – Slide 32 – J Chen
Distance between two shotsGiven two shots Si and Sj represented by two sets of R-frames, the distance between two shots is
where fll and fm enumerate through all R frames in shots i and j, respectively
( ) ( )mlji ffDSSd ,min, ≈
R-frames for shot i RAi RBi RCi . . .
R-frames for shot j RAj RBj RCj RCj. . .
COMP9519 Multimedia Systems – Lecture 11 – Slide 33 – J Chen
Visual dissimilarity between two clustersClustering: to groups shots into clusters
Visual dissimilarity between two clusters is defined as the maximum shot similarity between all possible shot similarity comparisons
where Ci is the ith cluster
Follows the same principle as in shot distance definition
COMP9519 Multimedia Systems – Lecture 11 – Slide 34 – J Chen
Criteria for time constrained clustering of video shots
TemporalIn each cluster, no two frames can be separated by more than Tframes apart
Visual similarityThe visual similarity between all shot-pairs in a cluster should be <= a threshold ε
Measures intra-cluster qualityThe similarity between all clusters should be > threshold ε
Measures inter-cluster qualityFollows the same principle when extracting R-frames
COMP9519 Multimedia Systems – Lecture 11 – Slide 35 – J Chen
From video to scene transition graph (Yeung, Yeo & Liu ’98)
COMP9519 Multimedia Systems – Lecture 11 – Slide 36 – J Chen
Scene segmentation with scene transition graph (Yeung, Yeo & Liu ’98)
A scene transition graph is a compact representation of a video that is a directed graph G=(V,E)
V: vertices. Each vertex represents a cluster of similar shotsE: edges. Two vertices i, j are connected via an edge if there exists a shot in cluster i that precedes a shot in cluster j
Scene segmentation is to finding cuts in the graphA cut edge is an edge when removed will result in two disconnected graphs
COMP9519 Multimedia Systems – Lecture 11 – Slide 37 – J Chen
GraphG=(V,E)
V: vertices. E: edges.
COMP9519 Multimedia Systems – Lecture 11 – Slide 38 – J Chen
A diagram of scene segmentation with scene transition graph (YYL ’98)
Note temporal constraint
COMP9519 Multimedia Systems – Lecture 11 – Slide 39 – J Chen
Scene Transition Graph
COMP9519 Multimedia Systems – Lecture 11 – Slide 40 – J Chen
Video Browser -- Key-frame Based Hierarchical Video Browser
* H.J.Zhang et al, Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution, ACM MM 2005
COMP9519 Multimedia Systems – Lecture 11 – Slide 41 – J Chen
Video Browser -- Browser Layout for Key-Frame Based Hierarchical Video Browser
COMP9519 Multimedia Systems – Lecture 11 – Slide 42 – J Chen
OutlineStructure of video
Frame, shot, scene, storyWhy need video structure analysis
Shot segmentationScene segmentationVideo Browser
Introduction to Query and Image RetrievalMultimedia retrieval applicationsFuture research
COMP9519 Multimedia Systems – Lecture 11 – Slide 43 – J Chen
Feature-based Similarity Search
Video Query
COMP9519 Multimedia Systems – Lecture 11 – Slide 44 – J Chen
Query typesPoint query
specifies a point in the data space and retrieves all point objects in the database with identical coordinates:
Range queryGiven a query point Q, a distance r, and a distance function M, retrieve all points P from the database, which have a distance smaller or equal to r from Q according to M:
Nearest neighbor queryGiven a query point Q, retrieve the nearest neighbor point P from the database, ie, find object
K-nearest neighbor queryGiven a query point, return the k nearest neighbor points
COMP9519 Multimedia Systems – Lecture 11 – Slide 45 – J Chen
Distance functionsEuclidean (L2) Manhattan (L1)Maximum (L∞)Weighted Euclidean Weighted maximum Ellipsoid where W is a positive definite similarity matrix
COMP9519 Multimedia Systems – Lecture 11 – Slide 46 – J Chen
Diagram of Image Retrieval System
Query specification
Feature extraction
Image database
Feature extraction
Image data
Similarity comparison
Indexing & retrieval
Retrieval results
Relevance feedback
user Feature vectors
output
Feature vectors
Feature vectors
COMP9519 Multimedia Systems – Lecture 11 – Slide 47 – J Chen
Image Retrieval: Client/Server mode
MPEG-7 Database
Media Server
Search Engine Query target
Query Response List of matchingcontentclip1.mp4clip2.mp4…
Request for ContentRTSP
Streaming MediaRTP/RTCP
COMP9519 Multimedia Systems – Lecture 11 – Slide 48 – J Chen
Application: IBM’s QBIC
QBIC – Query by Image Content
First commercial CBIR system.
Model system – influenced many others.
Uses color, texture, shape features
Text-based search can also be combined.
Uses R*-trees for indexing
*Deepak Bote
COMP9519 Multimedia Systems – Lecture 11 – Slide 49 – J Chen
QBIC – Search by color
** Images courtesy : Yong Rao
COMP9519 Multimedia Systems – Lecture 11 – Slide 50 – J Chen
QBIC – Search by shape
** Images courtesy : Yong Rao
COMP9519 Multimedia Systems – Lecture 11 – Slide 51 – J Chen
QBIC – Query by sketch
** Images courtesy : Yong Rao
COMP9519 Multimedia Systems – Lecture 11 – Slide 52 – J Chen
Architecture of QBIC Query System
COMP9519 Multimedia Systems – Lecture 11 – Slide 53 – J Chen
QBIC fast search and indexing
Pre-filteringA computationally fast filter is applied to all data
Lower bounding the histogram color distance via computing the distance based on average colors
Only items that pass through the filter are operated on by the second stage (which computes true similarity metric)
IndexingPCA and R*-tree
COMP9519 Multimedia Systems – Lecture 11 – Slide 54 – J Chen
QBIC demohttp://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English
COMP9519 Multimedia Systems – Lecture 11 – Slide 55 – J Chen
Some applications of content based image and video retrieval
WWW search
Digital library
Stock photography for remote printing
Search for stock images and clip art then print out
In the textile industry
Store and retrieve textile design digitally
Corporate videos, training and educational videos
At home
COMP9519 Multimedia Systems – Lecture 11 – Slide 56 – J Chen
Future directions – Multimedia
R. Jain (Georgia Institute of Technology): “Multimedia research is not keeping pace with the technology. In research, multimedia has to become multimedia.”
“Email will become more and more audio and video. Text mail will remain important, but audio and video mail will become popular and someday overtake text in its utility. Storage, retrieval and presentation of true multimedia will emerge. Current thinking about audio and video being an appendix to text will not take ustoo far.”
“Multimedia research must address how the semantics of the situation emerges out of individual data sources independent of the medium represented by the source. This issue of semantics is critical to all the applications of multimedia.”
COMP9519 Multimedia Systems – Lecture 11 – Slide 57 – J Chen
Future directions – Multimedia
D. Gibbon (AT&T Research Labs):
“(…) We must now focus on less structured media.
Media processing algorithms such as speech recognition and scene change detection have been proven to work well enough on clean, structured data, but are known to have poorer performance in the presence of noise or lack of structure.
Professionally produced media amounts to only the tip of the iceberg –if we include video teleconferencing, voice communications, video surveillance, etc. then the bulk of media has very little structure that is useful for indexing in adatabasesystem. This latter class of media represents a potentially richapplication space.”
COMP9519 Multimedia Systems – Lecture 11 – Slide 58 – J Chen
Future directions – Multimedia
T.S. Huang (University of Illinois at Urbana-Champiagn):
“Current real-world useful video databases, where the retrieval modes are flexible, and based on a combination of keywords and visual and audio contents, are nonexistent. We hope that such flexible systems will in the future find applications in: sport events, broadcasting news, documentaries, education and training, home videos, and above all biomedicine.”
COMP9519 Multimedia Systems – Lecture 11 – Slide 59 – J Chen
Outline
Structure of videoFrame, shot, scene, storyWhy need video structure analysis
Shot segmentationScene segmentationVideo Browser
Introduction to query and indexingMultimedia retrieval applicationsFuture research
COMP9519 Multimedia Systems – Lecture 11 – Slide 60 – J Chen
ReferencesChapter 4 of Book Multimedia Information Retrieval and Management An Interactive Comic Book Presentation for Exploring Video.John Boreczky, Andreas Girgensohn, Gene Golovchinsky, and Shingo Uchihashi in CHI 2000 Conference Proceedings, ACM Press, pp. 185-192, 2000., April 1, 2000 Christel, M., Smith, M., Taylor, C.R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions. In Proc. ACM CHI ’98(Los Angeles, CA, April 1998), ACM Press, 171-178Christel, M., Winkler, D., and Taylor, C.R. Improving Access to a Digital Video Library. In Human-Computer Interaction: INTERACT97, Chapman & Hall, London, 1997, 524-531
COMP9519 Multimedia Systems – Lecture 11 – Slide 61 – J Chen
ReferencesChristian Böhm, Stefan Berchtold, Daniel Keim: Searching in High-dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases, ACM Computing Surveys 33 (3), 2001. Slides by R.F.MoellerSlides by Deepak BoteSlides by Ahmet SenturkGuttman, Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57Guojun Lu, Techniques and Data Structures for Efficient Multimedia Retrieval based on Similarity, IEEE Transactions on Multimedia, Vol 4, No 3, September 2002 Anne.H.H.Ngu, Q. Z.Sheng, D. Q.Huynh, and R. Lei. Combining multi-visual features for efficient indexing in a large image database. The VLDB Journal, 9(4):280--293, May 2001 Jing Chen, project report, NICTAC. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994M. Flickner, et. al., "Query by Image and Video Content: The QBIC System," IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995. Slides by Oge Marques