semantic retrieval and automatic annotation: linear transformations, correlation and semantic spaces

Semantic Retrieval and Automatic Annotation

Linear Transformations, Correlation and Semantic Spaces

Jonathon Hare & Paul LewisSchool of Electronics and Computer Science

University of Southampton

Introduction and Motivation• Introduce a new, simple linear-transform based

annotation/retrieval technique

• Compare against a number of similar existing techniques for automatic annotation & semantic retrieval that:

• Represent images by a fixed length histogram (of visual-term occurrences)

• Optionally use SVD for noise reduction

• Are deterministic (no randomness)

• Are (relatively) computationally efficient

• Reflect on real-world performance

Singular Value Decomposition

• SVD can be used to filter noise by producing a rank-k estimate of the original data matrix

• The rank-k estimate is optimal in the least-squares sense

Nomenclature

• F is a visual-term occurrence matrix (columns represent images, rows visual-terms)

• W is a keyword occurrence matrix (columns represent images, rows keywords)

Technique: linear transform Assume that visual-term occurrence vectors can be related to keyword occurrence vectors by a simple

linear transformation, T.

FT=W

T can be estimated using the pseudo-inverse (calculated using the SVD, which allows noise reduction) given a training set with known F and W, then unknown W* can be calculated from F* (from unannotated images)

and T.

Technique: Semantic Spaces

• Based around the factorisation [-] = TD

• Calculated using truncated SVD

• Rows of T represent coordinates of the features and words in a vector space

• Columns of D represent coordinates of images in the same space

• Similar objects have similar locations in the space, so it is possible to rank images on their distance to a given word

F W

Hare, J. S., Lewis, P. H., Enser, P. G. B., and Sandom, C. J., “A Linear-Algebraic Technique with an Application in Semantic Image Retrieval,” in CIVR 2006, Sundaram, H., Naphade, M., Smith, J. R., and

Rui, Y., eds., LNCS 4071, 31–40, Springer (2006).

Technique: Correlation

• Pan et al defined four techniques for building translation tables between visual terms and keywords [i.e. the elements of the table/matrix represent p(wi,fj)].

• The Corr method used WTF to build the table

• The Cos method used the cosine of wi and fj

• The SVDCorr and SVDCos methods filtered the tables from the Corr and Cos methods reducing the rank using the SVD

Pan, J.-Y., Yang, H.-J., Duygulu, P., and Faloutsos, C., “Automatic image captioning,” IEEE International Conference on Multimedia and Expo 2004 (ICME ’04). Vol.3 (27-30 June 2004).

Technique SummaryTechnique Variables Notes

Transform feature-weighting,dimensionality reduction

Words independent

Corr, Cos feature-weighting Words independent

SVDCorr, SVDCos

feature-weighting,dimensionality reduction

Words independent

Semantic Space feature-weighting,dimensionality reduction

Inter-word dependencies

Image Features

• Two types of visual-term feature considered:

• Segmented-blob based (using shape, colour, texture descriptors) [500 terms]

• Quantised DCT-based [500 terms]

Experimental Protocol• 5000 image Corel data-set• 4000 training images• 500 validation images (for optimising reduced rank)• 500 test images

• Two weighting types: unweighted and IDF

• Evaluation performed as a hypothetical retrieval experiment• Unannotated test images retrieved in response to

using each word in turn as a query• Mean-average precision used for comparison

Results

Real-world performance

• ~20% mAP might sound low, but in reality many queries will work quite well (reasonable initial precision, but drops fast)

• Choice of image features is very important

• It would be difficult to learn the concept of “sun” from grey-level SIFT features!

• See the paper for some more reflection on real-word performance...

Conclusions• We have described a set of auto-annotation/semantic

retrieval algorithms

• Performance is less than the state-of-the-art, but this is partially explained by the use of different image features (see our MIR 2010 paper)

• However, the methods;

• Are computationally inexpensive (although this is proportional to the amount of training data)

• Are deterministic, and don’t rely on algorithms such as EM which might get stuck in local minima/maxima