semantic retrieval and automatic annotation: linear transformations, correlation and semantic spaces
TRANSCRIPT
Semantic Retrieval and Automatic Annotation
Linear Transformations, Correlation and Semantic Spaces
Jonathon Hare & Paul LewisSchool of Electronics and Computer Science
University of Southampton
Introduction and Motivation• Introduce a new, simple linear-transform based
annotation/retrieval technique
• Compare against a number of similar existing techniques for automatic annotation & semantic retrieval that:
• Represent images by a fixed length histogram (of visual-term occurrences)
• Optionally use SVD for noise reduction
• Are deterministic (no randomness)
• Are (relatively) computationally efficient
• Reflect on real-world performance
Singular Value Decomposition
• SVD can be used to filter noise by producing a rank-k estimate of the original data matrix
• The rank-k estimate is optimal in the least-squares sense
Nomenclature
• F is a visual-term occurrence matrix (columns represent images, rows visual-terms)
• W is a keyword occurrence matrix (columns represent images, rows keywords)
Technique: linear transform Assume that visual-term occurrence vectors can be related to keyword occurrence vectors by a simple
linear transformation, T.
FT=W
T can be estimated using the pseudo-inverse (calculated using the SVD, which allows noise reduction) given a training set with known F and W, then unknown W* can be calculated from F* (from unannotated images)
and T.
Technique: Semantic Spaces
• Based around the factorisation [-] = TD
• Calculated using truncated SVD
• Rows of T represent coordinates of the features and words in a vector space
• Columns of D represent coordinates of images in the same space
• Similar objects have similar locations in the space, so it is possible to rank images on their distance to a given word
F W
Hare, J. S., Lewis, P. H., Enser, P. G. B., and Sandom, C. J., “A Linear-Algebraic Technique with an Application in Semantic Image Retrieval,” in CIVR 2006, Sundaram, H., Naphade, M., Smith, J. R., and
Rui, Y., eds., LNCS 4071, 31–40, Springer (2006).
Technique: Correlation
• Pan et al defined four techniques for building translation tables between visual terms and keywords [i.e. the elements of the table/matrix represent p(wi,fj)].
• The Corr method used WTF to build the table
• The Cos method used the cosine of wi and fj
• The SVDCorr and SVDCos methods filtered the tables from the Corr and Cos methods reducing the rank using the SVD
Pan, J.-Y., Yang, H.-J., Duygulu, P., and Faloutsos, C., “Automatic image captioning,” IEEE International Conference on Multimedia and Expo 2004 (ICME ’04). Vol.3 (27-30 June 2004).
Technique SummaryTechnique Variables Notes
Transform feature-weighting,dimensionality reduction
Words independent
Corr, Cos feature-weighting Words independent
SVDCorr, SVDCos
feature-weighting,dimensionality reduction
Words independent
Semantic Space feature-weighting,dimensionality reduction
Inter-word dependencies
Image Features
• Two types of visual-term feature considered:
• Segmented-blob based (using shape, colour, texture descriptors) [500 terms]
• Quantised DCT-based [500 terms]
Experimental Protocol• 5000 image Corel data-set• 4000 training images• 500 validation images (for optimising reduced rank)• 500 test images
• Two weighting types: unweighted and IDF
• Evaluation performed as a hypothetical retrieval experiment• Unannotated test images retrieved in response to
using each word in turn as a query• Mean-average precision used for comparison
Real-world performance
• ~20% mAP might sound low, but in reality many queries will work quite well (reasonable initial precision, but drops fast)
• Choice of image features is very important
• It would be difficult to learn the concept of “sun” from grey-level SIFT features!
• See the paper for some more reflection on real-word performance...
Conclusions• We have described a set of auto-annotation/semantic
retrieval algorithms
• Performance is less than the state-of-the-art, but this is partially explained by the use of different image features (see our MIR 2010 paper)
• However, the methods;
• Are computationally inexpensive (although this is proportional to the amount of training data)
• Are deterministic, and don’t rely on algorithms such as EM which might get stuck in local minima/maxima