lecture 9: multimedia content - cse.unsw.edu.aucs9519/lecture_notes_06/l9_comp9519_4in… ·...

Lecture 9: Multimedia Content Description (2)

Dr Jing ChenNICTA & CSE UNSW

CS9519 Multimedia SystemsS2 2006

COMP9519 Multimedia Systems – Lecture 9 – Slide 2 – J. Chen

last week’s lecture …Why to describe multimedia content ?

Explosion in the source of digital media contentLarge collections of media items

Problem?How to search and discover multimedia content ? How to index long video and audio sequence ?How to more efficiently browse content ?

Application cases

Content description Standard : MPEG-7DefinitionGoal interoperability


Acknowledgement

Thanks Dr. Jack Yu for providing the initial version

of the lecture slides


OutlineIntroductionColor features

Color and color spacesHistograms and similarity metricsColor descriptors

Texture featuresShape featuresMotion features (next lecture)


Example


Visual featuresWhy visual features?

Manual labeling is subjective and time consumingDifficult to describe content by text completely

What visual featuresExtractable from image/videoLearn from human visual system

Mathematical representationM pixels in R3 (color channels) --> N-dimension features vectors

Eg, color histogram, 640*480 pixels in R3 -> 40-d vectors (bins)N << M, d is usually small


Good visual featuresGood visual features

Compactness of the representationDiscriminative powerInvariance: occlusion, shift, rotation, lighting change, etcComplexity


Popular visual featuresColor

Color histogramColor momentsDominant color

Texture: structural and statisticalEdge histogramTamura features

Shape: boundaries of objects

Motion: camera motion and object motion




Texture featuresShape featuresMotion featuresContent Search examples with features


Color

Ref: Gonzalez and Woods, digital image processing


Primary colorsOwing to the structure of the human visual system, all colors are seen as variable combinations of the three primary colors: Red, Green and Blue


RGB color spaceA color space is a 3-D coordinate system and a subspace within the system where each color is represented by a single point via itscoordinatesRGB is the most commonly used color space


HSV Colour SpaceThe hue (H) represents the dominant spectral component—color in its pure form, as in green, red, or yellowSaturation (S) refers to relative purity or the amount of white light mixed with a hueThe value (V) corresponds to the brightness of color.Why HSV colour space?

Perceptually uniform: geometric distance is consistent with perceptual distanceMore natural to humans: more meaningful, easier to work with

HSV color space as a cylindrical object COMP9519 Multimedia Systems – Lecture 9 – Slide 14 – J. Chen

RGB to HSV conversionV = max (r,g,b) S = (max (r,g,b)) - min (r,g,b)/max (r,g,b) H = depends on which of r,g,b is the maximum

#define RETURN_HSV(h, s, v) {HSV.H = h; HSV.S = s; HSV.V = v; return HSV;} // Theretically, hue 0 (pure red) is identical to hue 6 in these transforms. Pure// red always maps to 6 in this implementation. Therefore UNDEFINED can be// defined as 0 in situations where only unsigned numbers are desired.typedef struct {float R, G, B;} RGBType; typedef struct {float H, S, V;} HSVType; HSVType RGB_to_HSV( RGBType RGB ) { // RGB are each on [0, 1]. S and V are returned on [0, 1] and H is returned on [0, 6]. Exception: H is returned UNDEFINED if S==0.

float R = RGB.R, G = RGB.G, B = RGB.B, v, x, f; int i; HSVType HSV; x = min(R, G, B); v = max(R, G, B); if(v == x) RETURN_HSV(UNDEFINED, 0, v); f = (R == x) ? G - B : ((G == x) ? B - R : R - G); i = (R == x) ? 3 : ((G == x) ? 5 : 1); RETURN_HSV(i - f /(v - x), (v - x)/v, v);

}


Distance between two color points in HSV color spaceGiven two colors (h1, s1, v1) and (h2, s2, v2) where h, s and v are in the range of [0,1], the distance between these two colors is:


HMMD color spaceFive parameters

Hue – the same as defined for HSVMax and Min - the maximum and minimum among the R, G, and B values; blackness and whitenessDiff = Max – Min; colorfulnessSum = (Max+Min); brightness

Three parameters - Hue, Max and Min(or Hue, Diff and Sum) - are enough to describe the color space

Adopted in MPEG-7; used in the color structure descriptor (CSD)

Advantage: close to perceptually uniform

White Color

Max

Min

Black Color

Sum

Diff

Hue


Distance between Two Color Points in HMMD Suppose h is from 0 to 2π, min is from 0 to 1, max is from 0 to 1, diff is from 0 to sqrt(2)/2, sum is from 0 to sqrt(2), where

The distance between c1 and c2 is (the range of the distance value is 0 ~ 1)

where

and

2)2,1(

22 dsccdist +=


YCbCr color spaceITU-R BT.601 defines Y as the brightness (luma), Cb as blue minus luma (B-Y), and Cr as red minus luma (R-Y).Y to have a nominal range of 16-235 (blackwhite); Cb and Cr are to have a nominal range of 16-240, with 128 corresponding to zero. YCbCr is defined to have been derived from gamma pre-corrected component RGB signalsY = (77/256)R + (150/256)G + (29/256)BCb = -(44/256)R - (87/256)G + (131/256)B + 128Cr = (131/256)R - (110/256)G - (21/256)B + 128R = Y + 1.371(Cr - 128)G = Y - 0.698(Cr - 128) - 0.336(Cb - 128)B = Y + 1.732(Cb - 128)


YCbCr color space (continued)

If the 24-bit RGB data are to have a range of 0-255 (black-white), as commonly found in PCs, the following equations should be used to maintain the correct black and white levels:Y = 0.257R + 0.504G + 0.098B + 16Cb = -0.148R - 0.291G + 0.439B + 128Cr = 0.439R - 0.368G - 0.071B + 128R = 1.164(Y - 16) + 1.596(Cr - 128)G = 1.164(Y - 16) - 0.813(Cr - 128) - 0.392(Cb - 128)B = 1.164(Y - 16) + 2.017(Cb - 128)YCbCr represents color as brightness and two color difference signals, while RGB represents color as red, green and blue.




Texture featuresShape features


HistogramsOne-dimension data distribution = Set of (bin, frequency) pairs

Each bin has its associated attribute value

The set of bins partitions the feature (here the gray scale value) space

* Nuno Vasconcelos and Andrew Lippman


Color histogramsPartition the feature space off into several bins

Represent the statistical of the number of pixels in each binExample: partition of RGB color space into 8 bins

Three types of histograms depending on how we partition the color space

Fixed binningClustered binningAdaptive binning


Fixed binningThe same as scalar quantization of color spaceUse fixed size (for all images) binning of color space

* H.R.Wu


Adaptive binningThe same as VQ of color space adaptive to each imagePartitions the color space into irregular size bins to minimize the representation distortion incurredOptimization algorithm: k-means (aka Lloyd algorithm)

Applied to every image, time consuming

* H.R.Wu


Clustered binningGather accumulated statistical distribution of pixel values through a training set of images

Perform k-means clustering with the accumulated distribution using a pre-determined number of bins (N)

Get N clusters with their respective centroidsSimilar to codebook generation in VQ

Given a new image, calculate the color histogram based on the codebook generated in step 2

Assign color c to cluster i if distance between c and centroid of cluster i is <= the distance between c and all other N-1 cluster centroids


Comparison of different binning methods

Computational complexityFixed binning: very lowClustered: middle (k-means once for all images + for each image, assign colors to clusters which is part of an iteration in k-means clustering)Adaptive: high (k-means for each image individually)

Representation distortionFixed binning > clustered > adaptive


Similarity metrics

Given two histograms (two feature vectors) I and J, how do we quantify the similarity between these two?

Distance is the reverse of similarity, defined as D(I,J) = f(I,J) where f is a distance function


Minkowski-form distance metric

where p=1, 2 and ∞, and the corresponding D(I,J) is called L1, L2 (also called Euclidean distance) and L∞ distance respectively

pp

iii JxIxJID

/1

|)()(|),( ⎟⎠

⎞⎜⎝

⎛−= ∑

1 2 3 4 5 6 7 4I

0

1

2

3

4

5

6

7

Weight

Bin

IJ

Example: L2(I,J) =21/2


Kullback-Leibler Divergence and Jeffrey Divergence

Kullback-Leibler Divergence

Information theory interpretationMeasures how inefficient it is to code one histogram using the other as the code-bookNon-symmetric and sensitive to histogram binning

Jeffrey Divergence

Empirically derivedSymmetric Robust with respect to noise and the size of histogram bins

∑=i i

iiKL Jx

IxIxJID)()(log)(),(

∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛+=

i i

ii

i

iiJD Ix

JxJxJxIxIxJID

)()(log)(

)()(log)(),(


Χ2 statistics

Statistical interpretationMeasures how unlikely it is that one distribution was drawn from the population represented by the other

( )( )∑ +

−=

i ii

ii

JxIxJxIxJID

2/)()()()(),(

2

2χ


Bin-to-bin and cross-bin metricsMinkowski-form, K-L Divergence, Jeffrey Divergence and χ2

statistics are all bin-to-bin metricsDo not consider similarity across bins

Cross-bin distance metricsConsider the ground distance between bins

* Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover's distance as a metric for image retrieval,”Int Jour of Comp Vision, vol 40, no 2, pp 99-121, 2000.


Quadratic-form distance

Where is a similarity metric, aij denotes similarity (ground distance) between bins i and j, FI and FJ are vectors listing all bins in I and JSome ground distance functions:

Where dij is the L2 distance between bins i and j, dmax is maximum dij

Where σ is positive constantFaster roll-off of as a function of dij

)()(),( JIT

JIQF JID FFAFF −−=

][ ija=A

max/1 dda ijij −=

))/(exp( 2maxdda ijij σ−=


Earth Mover’s Distance (EMD)Given two histograms, move masses (earth) from one histogram to the other while minimising the cost of weight X ground distance

The two histograms do not require to have the same binning partitionEmpirically developed first, and statistical interpretation found second (Mallows distance)


EMD – mathematical representationGiven two histograms and where is the centre of cluster i and is the number of points in the cluster, the EMD is derived by solving as an optimal flow first which minimizes

subject to

is the ground distance between xi and xj.EMD is defined as

)},),...(,{( 11 mm pxpxP = )},),...(,{( 11 nn qxqxQ =ix ip

}{ ijfF =

ijd


Comparison of histogram similarity metrics

* Y. Rubner et al, ``Empirical evaluation of dissimilarity measures for color and texture,'‘ Comput Vis Image Underst, vol 84, no 1, pp 25-43, 2001.

yesnononononoPartial matches

Depending on the application; Χ2 usually gives reasonably good results

Accuracy in image retrieval

yesyesnonononoAdaptive binning support

yesyesnonononoGround distance

HighhighmediummediummediummediumComputational complexity

yesyesyesnoyesyesSymmetrical

EMDQFJDKLΧ2Lp






Color descriptors in MPEG-7


Dominant color descriptor (DCD)In the category of adaptive histogram

RGB is the default color space

Colors in an image are represented by N dominant color clusters

Where ci is the color of a cluster, pi is the fraction of the number of pixels in a cluster vs that of all pixels in the image,

optional parameter vi is the variation of color values of the pixels in the cluster,

s represents the overall spatial coherency of the dominant colors in the image

( ){ } NisvpcF iii ...2,1,,,, ==


Examples of high and low spatial coherency of color

Low High

* MPEG-7


Extraction of dominant colorsUsing Generalized Lloyd Algorithm (aka k-means)Minimizing distortion

Where ci is the centroid of cluster Ci, x(k) is the color at pixel k, and h(k) is the perceptual weight for pixel k in the form of an exponential function to account for the fact that HVS is more sensitive to changes in smooth regions than in texture regions (see Y. Deng, S.Kenney, M.S.Moore and B. S.Manjunath, "Peer group filtering and perceptual color image quantization", ISCAS'99, Orlando, FL, vol 4, pp.21-24 , June 1999.)

Update rule during optimization:

Difference to normal GLA/K-means: perceptual weighting h(k)

∑∑ ∈=−=i k

ii CkxNickxkhD )(,...1,)()( 2

ii Ckxkh

kxkhc ∈=

∑∑ )(,

)()()(


Similarity measurement for DCDDCD is essentially adaptive histogram, so Lp, Χ2, KL, JD etc are not suitableQuadratic form distance is adopted in MPEG-7Given two DCDs,

Where p is the percentage, and a is the ground distance between two colors

EMD may be applied here

)()(),( JIT

JIQF JID FFAFF −−=

∑ ∑∑∑= = ==

−+=2 1 21

1 1 1212,1

22

1

21 2

N

j

N

i

N

jjijij

N

ii ppapp


Scalable color descriptorColor Histogram in HSV Color SpaceEncoded by a Haar wavelet transform

Sum coefficients: [1 1]Diff coefficients: [1 -1]


Scalable color descriptor diagram






Texture What is texture?

Has structure or repetitious pattern, i.e., checkeredHas statistical pattern, i.e., grass, sand, rocks


Brodatz textures


Why textureWhy texture?

Application to satellite images, medical imagesDescribes contents of real world images, i.e., clouds, fabrics, surfaces, wood, stone

Challenging issuesRotation and scale invariance (3D)Segmentation/extraction of texture regions from imagesTexture in noise


Some approaches to texture featuresFourier Domain Energy Distribution

Angular features (directionality)

where

Radial features (coarseness)

where


MPEG-7 texture descriptors

Homogenous texture descriptorTexture browsing descriptorEdge histogram (Non-homogenous texture descriptor)


Homogenous Texture Descriptor (HTD)

Partitioning the frequency domain into 30 channels (modeled by a2D-Gabor function)

Computing the energy and energy deviation for each channel

Computing mean and standard variation of frequency coefficients

F = {fDC, fSD, e1,…, e30, d1,…, d30}


Channels used in computing the HTD

Frequency plane partition is uniform along the angular direction(30o), non-uniform along the radial direction (on an octave scale)Can be implemented by 2D Fourier Transform

1

2

713

8

910

141516

1920212223

24

34

5

611

12 1718

30

2526272829

0

Channel (Ci)

channel number (i)

ω


Gabor functionOn top of the feature channels, the following 2D Gabor(modulated Gaussian) function is applied to each individual channels

equivalent to weighting the Fourier transform coefficients of the image with a Gaussian centered at the frequency channels as defined above.

Each channel filters a specific type of texture (!!!)

( ) ( ) ( )⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−⋅

⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−= 2

2

2

2

, 2exp

2exp

rs

rsrsPG

θρ σθθ

σωω

θ,ω


Demohttp://nayana.ece.ucsb.edu/M7TextureDemo/Demo/client/M7TextureDemo.html


Edge histogram (I) sub-imagesImages are divided into 16 non-overlapping sub-images.

* Manjunath paper


Edge histogram (II) edge detection filtersEdges in the sub-images are categorized into five types: vertical, horizontal, diagonal, diagonal and non-directional edges.

Filters for edge detection (applied to 2x2 blocks)

a) vertical b) horizontal c) 45 degree d) 135 degree e)non-directional edge edge edge edge edge

* Manjunath paper & MPEG-7COMP9519 Multimedia Systems – Lecture 9 – Slide 56 – J. Chen

Edge histogram (III)For each sub-image, local edge histograms can be constructed to represent the distribution of the five-types of edges in the sub-image

Totally 16x5=80 local edge histogram bins. A global-edge histogram and 65 semi-global edge histograms are computed from the 80 local histogram bins.

For the global edge histogram, the five types of edge distributions for all subimages are accumulated. For the semi-global edge histograms, subsets of subimages are grouped.

L1 norm of the distance of local, semi-global and global histograms between two frames is adopted as the distance function.

The distance of the global histogram difference is multiplied by 5 given the number of bins of the global histogram is much smaller than that of local and semi-global histograms.






Examples of contour- and region-based shape similarity

Horizontal bar: similar shapes by Region-BasedVertical bar: Similar shapes by Contour-Based


MPEG-7 shape features2-dimensional (2D)

Region Shape descriptorThe distribution of all pixels within a region

Contour Shape descriptor Shape properties of a contour of an object

3-dimensional (3D) Shape3D descriptor

Intrinsic shape characterization of 3D mesh models MultipleView descriptor

Combined with a 2D shape descriptor for 3D shape description


Region shape descriptorExpresses pixel distribution within a 2-D object region

Based on both boundary and internal pixelsUses a complex 2D-Angular Radial Transformation (ART)

Real parts of the 2-D basis functions whose origins are at the centers of each image

Advantages:It gives a compact and efficient way of describing properties ofmultiple disjoint regions simultaneouslyThe descriptor is robust to segmentation noise


Contour shape descriptorDefines a closed contour of a 2D object or region in an image or video sequence Examples of shapes where a contour-based descriptor is applicable Advantages:

Very compact representation (below 14 bytes in size on average) Can find semantically similar shapes – Fig (c)Robust to significant non-rigid deformations – Fig (d)Robust to distortions in the contour due to perspective transformation –Fig (e)

Using Curvature Scale Space (CSS) representation


Curvature Scale-Space (CSS)Finds curvature zero crossing points of the shape’s contour (key points)

Contour curvature function zero-crossing points separate concave and convex parts of the contour

Reduces the number of key points iteratively, by applying Gaussian smoothing The position of key points (horizontal coordinates) are expressed relative to the length of the contour curveThe vertical-coordinates (y_css) correspond to the amount of filtering applied * MPEG-7


Concave and convex functionsFunction f is concave if the line segment joining any two points on the graph of f is never above the graph; f is convex if the line segment joining any two points on the graph is never below the graph

Convex Concave


Application – trademark retrieval


ReviewIntroductionColor features


Texture featuresShape featuresMotion features (next week)


Key References

B.S. Manjunath , Phillipe Salembier , Thomas Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley & Sons, Inc., New York, NY, 2002 (Book)MPEG-7 visual standardT. Sikora, “The MPEG-7 Visual Standard for Content Description-an overview ”IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 696-702, June 2001 B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada, “MPEG-7 Color and Texture Descriptors”, IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 703-715, June 2001M. Bober, “MPEG-7 Visual Shape Descriptors”, IEEE Trans. Circuits Syst. Video Technol., vol. 11, pp. 716-719, June 2001

lecture 9: multimedia content - cse.unsw.edu.aucs9519/lecture_notes_06/l9_comp9519_4in… ·...

Documents