lecture 9: multimedia content - cse.unsw.edu.aucs9519/lecture_notes_06/l9_comp9519_4in… ·...
TRANSCRIPT
Lecture 9: Multimedia Content Description (2)
Dr Jing ChenNICTA & CSE UNSW
CS9519 Multimedia SystemsS2 2006
COMP9519 Multimedia Systems – Lecture 9 – Slide 2 – J. Chen
last week’s lecture …Why to describe multimedia content ?
Explosion in the source of digital media contentLarge collections of media items
Problem?How to search and discover multimedia content ? How to index long video and audio sequence ?How to more efficiently browse content ?
Application cases
Content description Standard : MPEG-7DefinitionGoal interoperability
COMP9519 Multimedia Systems – Lecture 9 – Slide 3 – J. Chen
Acknowledgement
Thanks Dr. Jack Yu for providing the initial version
of the lecture slides
COMP9519 Multimedia Systems – Lecture 9 – Slide 4 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape featuresMotion features (next lecture)
COMP9519 Multimedia Systems – Lecture 9 – Slide 5 – J. Chen
Example
COMP9519 Multimedia Systems – Lecture 9 – Slide 6 – J. Chen
Visual featuresWhy visual features?
Manual labeling is subjective and time consumingDifficult to describe content by text completely
What visual featuresExtractable from image/videoLearn from human visual system
Mathematical representationM pixels in R3 (color channels) --> N-dimension features vectors
Eg, color histogram, 640*480 pixels in R3 -> 40-d vectors (bins)N << M, d is usually small
COMP9519 Multimedia Systems – Lecture 9 – Slide 7 – J. Chen
Good visual featuresGood visual features
Compactness of the representationDiscriminative powerInvariance: occlusion, shift, rotation, lighting change, etcComplexity
COMP9519 Multimedia Systems – Lecture 9 – Slide 8 – J. Chen
Popular visual featuresColor
Color histogramColor momentsDominant color
Texture: structural and statisticalEdge histogramTamura features
Shape: boundaries of objects
Motion: camera motion and object motion
COMP9519 Multimedia Systems – Lecture 9 – Slide 9 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape featuresMotion featuresContent Search examples with features
COMP9519 Multimedia Systems – Lecture 9 – Slide 10 – J. Chen
Color
Ref: Gonzalez and Woods, digital image processing
COMP9519 Multimedia Systems – Lecture 9 – Slide 11 – J. Chen
Primary colorsOwing to the structure of the human visual system, all colors are seen as variable combinations of the three primary colors: Red, Green and Blue
COMP9519 Multimedia Systems – Lecture 9 – Slide 12 – J. Chen
RGB color spaceA color space is a 3-D coordinate system and a subspace within the system where each color is represented by a single point via itscoordinatesRGB is the most commonly used color space
COMP9519 Multimedia Systems – Lecture 9 – Slide 13 – J. Chen
HSV Colour SpaceThe hue (H) represents the dominant spectral component—color in its pure form, as in green, red, or yellowSaturation (S) refers to relative purity or the amount of white light mixed with a hueThe value (V) corresponds to the brightness of color.Why HSV colour space?
Perceptually uniform: geometric distance is consistent with perceptual distanceMore natural to humans: more meaningful, easier to work with
HSV color space as a cylindrical object COMP9519 Multimedia Systems – Lecture 9 – Slide 14 – J. Chen
RGB to HSV conversionV = max (r,g,b) S = (max (r,g,b)) - min (r,g,b)/max (r,g,b) H = depends on which of r,g,b is the maximum
#define RETURN_HSV(h, s, v) {HSV.H = h; HSV.S = s; HSV.V = v; return HSV;} // Theretically, hue 0 (pure red) is identical to hue 6 in these transforms. Pure// red always maps to 6 in this implementation. Therefore UNDEFINED can be// defined as 0 in situations where only unsigned numbers are desired.typedef struct {float R, G, B;} RGBType; typedef struct {float H, S, V;} HSVType; HSVType RGB_to_HSV( RGBType RGB ) { // RGB are each on [0, 1]. S and V are returned on [0, 1] and H is returned on [0, 6]. Exception: H is returned UNDEFINED if S==0.
float R = RGB.R, G = RGB.G, B = RGB.B, v, x, f; int i; HSVType HSV; x = min(R, G, B); v = max(R, G, B); if(v == x) RETURN_HSV(UNDEFINED, 0, v); f = (R == x) ? G - B : ((G == x) ? B - R : R - G); i = (R == x) ? 3 : ((G == x) ? 5 : 1); RETURN_HSV(i - f /(v - x), (v - x)/v, v);
}
COMP9519 Multimedia Systems – Lecture 9 – Slide 15 – J. Chen
Distance between two color points in HSV color spaceGiven two colors (h1, s1, v1) and (h2, s2, v2) where h, s and v are in the range of [0,1], the distance between these two colors is:
COMP9519 Multimedia Systems – Lecture 9 – Slide 16 – J. Chen
HMMD color spaceFive parameters
Hue – the same as defined for HSVMax and Min - the maximum and minimum among the R, G, and B values; blackness and whitenessDiff = Max – Min; colorfulnessSum = (Max+Min); brightness
Three parameters - Hue, Max and Min(or Hue, Diff and Sum) - are enough to describe the color space
Adopted in MPEG-7; used in the color structure descriptor (CSD)
Advantage: close to perceptually uniform
White Color
Max
Min
Black Color
Sum
Diff
Hue
COMP9519 Multimedia Systems – Lecture 9 – Slide 17 – J. Chen
Distance between Two Color Points in HMMD Suppose h is from 0 to 2π, min is from 0 to 1, max is from 0 to 1, diff is from 0 to sqrt(2)/2, sum is from 0 to sqrt(2), where
The distance between c1 and c2 is (the range of the distance value is 0 ~ 1)
where
and
2)2,1(
22 dsccdist +=
COMP9519 Multimedia Systems – Lecture 9 – Slide 18 – J. Chen
YCbCr color spaceITU-R BT.601 defines Y as the brightness (luma), Cb as blue minus luma (B-Y), and Cr as red minus luma (R-Y).Y to have a nominal range of 16-235 (blackwhite); Cb and Cr are to have a nominal range of 16-240, with 128 corresponding to zero. YCbCr is defined to have been derived from gamma pre-corrected component RGB signalsY = (77/256)R + (150/256)G + (29/256)BCb = -(44/256)R - (87/256)G + (131/256)B + 128Cr = (131/256)R - (110/256)G - (21/256)B + 128R = Y + 1.371(Cr - 128)G = Y - 0.698(Cr - 128) - 0.336(Cb - 128)B = Y + 1.732(Cb - 128)
COMP9519 Multimedia Systems – Lecture 9 – Slide 19 – J. Chen
YCbCr color space (continued)
If the 24-bit RGB data are to have a range of 0-255 (black-white), as commonly found in PCs, the following equations should be used to maintain the correct black and white levels:Y = 0.257R + 0.504G + 0.098B + 16Cb = -0.148R - 0.291G + 0.439B + 128Cr = 0.439R - 0.368G - 0.071B + 128R = 1.164(Y - 16) + 1.596(Cr - 128)G = 1.164(Y - 16) - 0.813(Cr - 128) - 0.392(Cb - 128)B = 1.164(Y - 16) + 2.017(Cb - 128)YCbCr represents color as brightness and two color difference signals, while RGB represents color as red, green and blue.
COMP9519 Multimedia Systems – Lecture 9 – Slide 20 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape features
COMP9519 Multimedia Systems – Lecture 9 – Slide 21 – J. Chen
HistogramsOne-dimension data distribution = Set of (bin, frequency) pairs
Each bin has its associated attribute value
The set of bins partitions the feature (here the gray scale value) space
* Nuno Vasconcelos and Andrew Lippman
COMP9519 Multimedia Systems – Lecture 9 – Slide 22 – J. Chen
Color histogramsPartition the feature space off into several bins
Represent the statistical of the number of pixels in each binExample: partition of RGB color space into 8 bins
Three types of histograms depending on how we partition the color space
Fixed binningClustered binningAdaptive binning
COMP9519 Multimedia Systems – Lecture 9 – Slide 23 – J. Chen
Fixed binningThe same as scalar quantization of color spaceUse fixed size (for all images) binning of color space
* H.R.Wu
COMP9519 Multimedia Systems – Lecture 9 – Slide 24 – J. Chen
Adaptive binningThe same as VQ of color space adaptive to each imagePartitions the color space into irregular size bins to minimize the representation distortion incurredOptimization algorithm: k-means (aka Lloyd algorithm)
Applied to every image, time consuming
* H.R.Wu
COMP9519 Multimedia Systems – Lecture 9 – Slide 25 – J. Chen
Clustered binningGather accumulated statistical distribution of pixel values through a training set of images
Perform k-means clustering with the accumulated distribution using a pre-determined number of bins (N)
Get N clusters with their respective centroidsSimilar to codebook generation in VQ
Given a new image, calculate the color histogram based on the codebook generated in step 2
Assign color c to cluster i if distance between c and centroid of cluster i is <= the distance between c and all other N-1 cluster centroids
COMP9519 Multimedia Systems – Lecture 9 – Slide 26 – J. Chen
Comparison of different binning methods
Computational complexityFixed binning: very lowClustered: middle (k-means once for all images + for each image, assign colors to clusters which is part of an iteration in k-means clustering)Adaptive: high (k-means for each image individually)
Representation distortionFixed binning > clustered > adaptive
COMP9519 Multimedia Systems – Lecture 9 – Slide 27 – J. Chen
Similarity metrics
Given two histograms (two feature vectors) I and J, how do we quantify the similarity between these two?
Distance is the reverse of similarity, defined as D(I,J) = f(I,J) where f is a distance function
COMP9519 Multimedia Systems – Lecture 9 – Slide 28 – J. Chen
Minkowski-form distance metric
where p=1, 2 and ∞, and the corresponding D(I,J) is called L1, L2 (also called Euclidean distance) and L∞ distance respectively
pp
iii JxIxJID
/1
|)()(|),( ⎟⎠
⎞⎜⎝
⎛−= ∑
1 2 3 4 5 6 7 4I
0
1
2
3
4
5
6
7
Weight
Bin
IJ
Example: L2(I,J) =21/2
COMP9519 Multimedia Systems – Lecture 9 – Slide 29 – J. Chen
Kullback-Leibler Divergence and Jeffrey Divergence
Kullback-Leibler Divergence
Information theory interpretationMeasures how inefficient it is to code one histogram using the other as the code-bookNon-symmetric and sensitive to histogram binning
Jeffrey Divergence
Empirically derivedSymmetric Robust with respect to noise and the size of histogram bins
∑=i i
iiKL Jx
IxIxJID)()(log)(),(
∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛+=
i i
ii
i
iiJD Ix
JxJxJxIxIxJID
)()(log)(
)()(log)(),(
COMP9519 Multimedia Systems – Lecture 9 – Slide 30 – J. Chen
Χ2 statistics
Statistical interpretationMeasures how unlikely it is that one distribution was drawn from the population represented by the other
( )( )∑ +
−=
i ii
ii
JxIxJxIxJID
2/)()()()(),(
2
2χ
COMP9519 Multimedia Systems – Lecture 9 – Slide 31 – J. Chen
Bin-to-bin and cross-bin metricsMinkowski-form, K-L Divergence, Jeffrey Divergence and χ2
statistics are all bin-to-bin metricsDo not consider similarity across bins
Cross-bin distance metricsConsider the ground distance between bins
* Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover's distance as a metric for image retrieval,”Int Jour of Comp Vision, vol 40, no 2, pp 99-121, 2000.
COMP9519 Multimedia Systems – Lecture 9 – Slide 32 – J. Chen
Quadratic-form distance
Where is a similarity metric, aij denotes similarity (ground distance) between bins i and j, FI and FJ are vectors listing all bins in I and JSome ground distance functions:
Where dij is the L2 distance between bins i and j, dmax is maximum dij
Where σ is positive constantFaster roll-off of as a function of dij
)()(),( JIT
JIQF JID FFAFF −−=
][ ija=A
max/1 dda ijij −=
))/(exp( 2maxdda ijij σ−=
COMP9519 Multimedia Systems – Lecture 9 – Slide 33 – J. Chen
Earth Mover’s Distance (EMD)Given two histograms, move masses (earth) from one histogram to the other while minimising the cost of weight X ground distance
The two histograms do not require to have the same binning partitionEmpirically developed first, and statistical interpretation found second (Mallows distance)
COMP9519 Multimedia Systems – Lecture 9 – Slide 34 – J. Chen
EMD – mathematical representationGiven two histograms and where is the centre of cluster i and is the number of points in the cluster, the EMD is derived by solving as an optimal flow first which minimizes
subject to
is the ground distance between xi and xj.EMD is defined as
)},),...(,{( 11 mm pxpxP = )},),...(,{( 11 nn qxqxQ =ix ip
}{ ijfF =
ijd
COMP9519 Multimedia Systems – Lecture 9 – Slide 35 – J. Chen
Comparison of histogram similarity metrics
* Y. Rubner et al, ``Empirical evaluation of dissimilarity measures for color and texture,'‘ Comput Vis Image Underst, vol 84, no 1, pp 25-43, 2001.
yesnononononoPartial matches
Depending on the application; Χ2 usually gives reasonably good results
Accuracy in image retrieval
yesyesnonononoAdaptive binning support
yesyesnonononoGround distance
HighhighmediummediummediummediumComputational complexity
yesyesyesnoyesyesSymmetrical
EMDQFJDKLΧ2Lp
COMP9519 Multimedia Systems – Lecture 9 – Slide 36 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape features
COMP9519 Multimedia Systems – Lecture 9 – Slide 37 – J. Chen
Color descriptors in MPEG-7
COMP9519 Multimedia Systems – Lecture 9 – Slide 38 – J. Chen
Dominant color descriptor (DCD)In the category of adaptive histogram
RGB is the default color space
Colors in an image are represented by N dominant color clusters
Where ci is the color of a cluster, pi is the fraction of the number of pixels in a cluster vs that of all pixels in the image,
optional parameter vi is the variation of color values of the pixels in the cluster,
s represents the overall spatial coherency of the dominant colors in the image
( ){ } NisvpcF iii ...2,1,,,, ==
COMP9519 Multimedia Systems – Lecture 9 – Slide 39 – J. Chen
Examples of high and low spatial coherency of color
Low High
* MPEG-7
COMP9519 Multimedia Systems – Lecture 9 – Slide 40 – J. Chen
Extraction of dominant colorsUsing Generalized Lloyd Algorithm (aka k-means)Minimizing distortion
Where ci is the centroid of cluster Ci, x(k) is the color at pixel k, and h(k) is the perceptual weight for pixel k in the form of an exponential function to account for the fact that HVS is more sensitive to changes in smooth regions than in texture regions (see Y. Deng, S.Kenney, M.S.Moore and B. S.Manjunath, "Peer group filtering and perceptual color image quantization", ISCAS'99, Orlando, FL, vol 4, pp.21-24 , June 1999.)
Update rule during optimization:
Difference to normal GLA/K-means: perceptual weighting h(k)
∑∑ ∈=−=i k
ii CkxNickxkhD )(,...1,)()( 2
ii Ckxkh
kxkhc ∈=
∑∑ )(,
)()()(
COMP9519 Multimedia Systems – Lecture 9 – Slide 41 – J. Chen
Similarity measurement for DCDDCD is essentially adaptive histogram, so Lp, Χ2, KL, JD etc are not suitableQuadratic form distance is adopted in MPEG-7Given two DCDs,
Where p is the percentage, and a is the ground distance between two colors
EMD may be applied here
)()(),( JIT
JIQF JID FFAFF −−=
∑ ∑∑∑= = ==
−+=2 1 21
1 1 1212,1
22
1
21 2
N
j
N
i
N
jjijij
N
ii ppapp
COMP9519 Multimedia Systems – Lecture 9 – Slide 42 – J. Chen
Scalable color descriptorColor Histogram in HSV Color SpaceEncoded by a Haar wavelet transform
Sum coefficients: [1 1]Diff coefficients: [1 -1]
COMP9519 Multimedia Systems – Lecture 9 – Slide 43 – J. Chen
Scalable color descriptor diagram
COMP9519 Multimedia Systems – Lecture 9 – Slide 44 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape features
COMP9519 Multimedia Systems – Lecture 9 – Slide 45 – J. Chen
Texture What is texture?
Has structure or repetitious pattern, i.e., checkeredHas statistical pattern, i.e., grass, sand, rocks
COMP9519 Multimedia Systems – Lecture 9 – Slide 46 – J. Chen
Brodatz textures
COMP9519 Multimedia Systems – Lecture 9 – Slide 47 – J. Chen
Why textureWhy texture?
Application to satellite images, medical imagesDescribes contents of real world images, i.e., clouds, fabrics, surfaces, wood, stone
Challenging issuesRotation and scale invariance (3D)Segmentation/extraction of texture regions from imagesTexture in noise
COMP9519 Multimedia Systems – Lecture 9 – Slide 48 – J. Chen
Some approaches to texture featuresFourier Domain Energy Distribution
Angular features (directionality)
where
Radial features (coarseness)
where
COMP9519 Multimedia Systems – Lecture 9 – Slide 49 – J. Chen
MPEG-7 texture descriptors
Homogenous texture descriptorTexture browsing descriptorEdge histogram (Non-homogenous texture descriptor)
COMP9519 Multimedia Systems – Lecture 9 – Slide 50 – J. Chen
Homogenous Texture Descriptor (HTD)
Partitioning the frequency domain into 30 channels (modeled by a2D-Gabor function)
Computing the energy and energy deviation for each channel
Computing mean and standard variation of frequency coefficients
F = {fDC, fSD, e1,…, e30, d1,…, d30}
COMP9519 Multimedia Systems – Lecture 9 – Slide 51 – J. Chen
Channels used in computing the HTD
Frequency plane partition is uniform along the angular direction(30o), non-uniform along the radial direction (on an octave scale)Can be implemented by 2D Fourier Transform
1
2
713
8
910
141516
1920212223
24
34
5
611
12 1718
30
2526272829
0
Channel (Ci)
channel number (i)
ω
COMP9519 Multimedia Systems – Lecture 9 – Slide 52 – J. Chen
Gabor functionOn top of the feature channels, the following 2D Gabor(modulated Gaussian) function is applied to each individual channels
equivalent to weighting the Fourier transform coefficients of the image with a Gaussian centered at the frequency channels as defined above.
Each channel filters a specific type of texture (!!!)
( ) ( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡ −−⋅
⎥⎥⎦
⎤
⎢⎢⎣
⎡ −−= 2
2
2
2
, 2exp
2exp
rs
rsrsPG
θρ σθθ
σωω
θ,ω
COMP9519 Multimedia Systems – Lecture 9 – Slide 53 – J. Chen
Demohttp://nayana.ece.ucsb.edu/M7TextureDemo/Demo/client/M7TextureDemo.html
COMP9519 Multimedia Systems – Lecture 9 – Slide 54 – J. Chen
Edge histogram (I) sub-imagesImages are divided into 16 non-overlapping sub-images.
* Manjunath paper
COMP9519 Multimedia Systems – Lecture 9 – Slide 55 – J. Chen
Edge histogram (II) edge detection filtersEdges in the sub-images are categorized into five types: vertical, horizontal, diagonal, diagonal and non-directional edges.
Filters for edge detection (applied to 2x2 blocks)
a) vertical b) horizontal c) 45 degree d) 135 degree e)non-directional edge edge edge edge edge
* Manjunath paper & MPEG-7COMP9519 Multimedia Systems – Lecture 9 – Slide 56 – J. Chen
Edge histogram (III)For each sub-image, local edge histograms can be constructed to represent the distribution of the five-types of edges in the sub-image
Totally 16x5=80 local edge histogram bins. A global-edge histogram and 65 semi-global edge histograms are computed from the 80 local histogram bins.
For the global edge histogram, the five types of edge distributions for all subimages are accumulated. For the semi-global edge histograms, subsets of subimages are grouped.
L1 norm of the distance of local, semi-global and global histograms between two frames is adopted as the distance function.
The distance of the global histogram difference is multiplied by 5 given the number of bins of the global histogram is much smaller than that of local and semi-global histograms.
COMP9519 Multimedia Systems – Lecture 9 – Slide 57 – J. Chen
OutlineIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape features
COMP9519 Multimedia Systems – Lecture 9 – Slide 58 – J. Chen
Examples of contour- and region-based shape similarity
Horizontal bar: similar shapes by Region-BasedVertical bar: Similar shapes by Contour-Based
COMP9519 Multimedia Systems – Lecture 9 – Slide 59 – J. Chen
MPEG-7 shape features2-dimensional (2D)
Region Shape descriptorThe distribution of all pixels within a region
Contour Shape descriptor Shape properties of a contour of an object
3-dimensional (3D) Shape3D descriptor
Intrinsic shape characterization of 3D mesh models MultipleView descriptor
Combined with a 2D shape descriptor for 3D shape description
COMP9519 Multimedia Systems – Lecture 9 – Slide 60 – J. Chen
Region shape descriptorExpresses pixel distribution within a 2-D object region
Based on both boundary and internal pixelsUses a complex 2D-Angular Radial Transformation (ART)
Real parts of the 2-D basis functions whose origins are at the centers of each image
Advantages:It gives a compact and efficient way of describing properties ofmultiple disjoint regions simultaneouslyThe descriptor is robust to segmentation noise
COMP9519 Multimedia Systems – Lecture 9 – Slide 61 – J. Chen
Contour shape descriptorDefines a closed contour of a 2D object or region in an image or video sequence Examples of shapes where a contour-based descriptor is applicable Advantages:
Very compact representation (below 14 bytes in size on average) Can find semantically similar shapes – Fig (c)Robust to significant non-rigid deformations – Fig (d)Robust to distortions in the contour due to perspective transformation –Fig (e)
Using Curvature Scale Space (CSS) representation
COMP9519 Multimedia Systems – Lecture 9 – Slide 62 – J. Chen
Curvature Scale-Space (CSS)Finds curvature zero crossing points of the shape’s contour (key points)
Contour curvature function zero-crossing points separate concave and convex parts of the contour
Reduces the number of key points iteratively, by applying Gaussian smoothing The position of key points (horizontal coordinates) are expressed relative to the length of the contour curveThe vertical-coordinates (y_css) correspond to the amount of filtering applied * MPEG-7
COMP9519 Multimedia Systems – Lecture 9 – Slide 63 – J. Chen
Concave and convex functionsFunction f is concave if the line segment joining any two points on the graph of f is never above the graph; f is convex if the line segment joining any two points on the graph is never below the graph
Convex Concave
COMP9519 Multimedia Systems – Lecture 9 – Slide 64 – J. Chen
Application – trademark retrieval
COMP9519 Multimedia Systems – Lecture 9 – Slide 65 – J. Chen
ReviewIntroductionColor features
Color and color spacesHistograms and similarity metricsColor descriptors
Texture featuresShape featuresMotion features (next week)
COMP9519 Multimedia Systems – Lecture 9 – Slide 66 – J. Chen
Key References
B.S. Manjunath , Phillipe Salembier , Thomas Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley & Sons, Inc., New York, NY, 2002 (Book)MPEG-7 visual standardT. Sikora, “The MPEG-7 Visual Standard for Content Description-an overview ”IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 696-702, June 2001 B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada, “MPEG-7 Color and Texture Descriptors”, IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 703-715, June 2001M. Bober, “MPEG-7 Visual Shape Descriptors”, IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 716-719, June 2001