image classification based on improved vladbcmi.sjtu.edu.cn/~pengyong/pub2015/mta2015.pdf · three...

23
Multimed Tools Appl DOI 10.1007/s11042-015-2524-6 Image classification based on improved VLAD Xianzhong Long · Hongtao Lu · Yong Peng · Xianzhong Wang · Shaokun Feng Received: 25 August 2014 / Revised: 22 December 2014 / Accepted: 18 February 2015 © Springer Science+Business Media New York 2015 Abstract Recently, a coding scheme called vector of locally aggregated descriptors (VLAD) has got tremendous successes in large scale image retrieval due to its efficiency of compact representation. VLAD employs only the nearest neighbor visual word in dic- tionary to aggregate each descriptor feature. It has fast retrieval speed and high retrieval accuracy under small dictionary size. In this paper, we give three improved VLAD vari- ations for image classification: first, similar to the bag of words (BoW) model, we count the number of descriptors belonging to each cluster center and add it to VLAD; second, in order to expand the impact of residuals, squared residuals are taken into account; thirdly, in contrast with one nearest neighbor visual word, we try to look for two nearest neigh- bor visual words for aggregating each descriptor. Experimental results on UIUC Sports Event, Corel 10 and 15 Scenes datasets show that the proposed methods outperform some state-of-the-art coding schemes in terms of the classification accuracy and computation speed. X. Long () School of Computer Science & Technology, School of Software, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China e-mail: [email protected] H. Lu · Y. Peng · X. Wang · S. Feng Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China H. Lu e-mail: [email protected] Y. Peng e-mail: [email protected] X. Wang e-mail: [email protected] S. Feng e-mail: [email protected]

Upload: others

Post on 23-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools ApplDOI 10.1007/s11042-015-2524-6

Image classification based on improved VLAD

Xianzhong Long · Hongtao Lu · Yong Peng ·Xianzhong Wang · Shaokun Feng

Received: 25 August 2014 / Revised: 22 December 2014 / Accepted: 18 February 2015© Springer Science+Business Media New York 2015

Abstract Recently, a coding scheme called vector of locally aggregated descriptors(VLAD) has got tremendous successes in large scale image retrieval due to its efficiencyof compact representation. VLAD employs only the nearest neighbor visual word in dic-tionary to aggregate each descriptor feature. It has fast retrieval speed and high retrievalaccuracy under small dictionary size. In this paper, we give three improved VLAD vari-ations for image classification: first, similar to the bag of words (BoW) model, we countthe number of descriptors belonging to each cluster center and add it to VLAD; second, inorder to expand the impact of residuals, squared residuals are taken into account; thirdly,in contrast with one nearest neighbor visual word, we try to look for two nearest neigh-bor visual words for aggregating each descriptor. Experimental results on UIUC SportsEvent, Corel 10 and 15 Scenes datasets show that the proposed methods outperform somestate-of-the-art coding schemes in terms of the classification accuracy and computationspeed.

X. Long (�)School of Computer Science & Technology, School of Software, Nanjing University of Postsand Telecommunications, Nanjing, 210023, Chinae-mail: [email protected]

H. Lu · Y. Peng · X. Wang · S. FengKey Laboratory of Shanghai Education Commission for Intelligent Interaction and CognitiveEngineering, Department of Computer Science and Engineering,Shanghai Jiao Tong University, Shanghai, 200240, China

H. Lue-mail: [email protected]

Y. Penge-mail: [email protected]

X. Wange-mail: [email protected]

S. Fenge-mail: [email protected]

Page 2: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

Keywords Image classification · Scale-invariant feature transform · Vector of locallyaggregated descriptors · K-means clustering algorithm

1 Introduction

As one of the most important and challenging tasks in computer vision and pattern recogni-tion fields, image classification has recently got many attention. There are some benchmarkdatasets used to evaluate the classification performance of image classification algorithms,for example, UIUC sports event [23], Corel 10 [26], 15 Scenes [21], Caltech 101 [10] andCaltech 256 [14], etc. Many image classification models have recently been proposed, suchas generative models [2, 22, 33], discriminative models [9, 18, 27, 39] and hybrid gener-ative/discriminative models [3]. Generative model classifies images from the viewpoint ofprobability, it only depends on the data themselves and does not require training or learningparameters. In contrast, discriminative model solves classification problem from the non-probabilistic perspective, which needs to train or learn parameters appeared in the classifier.Here, we only consider image classification based on discriminative model.

In the discriminative models, the earliest bag of words (BoW) technique [35] won thegreatest popularity and had a wide range of applications in the fields of image retrieval[31], video event detection [37] and image classification [6, 13]. However, the BoW repre-sentation does not possess enough descriptive capability because it is the histogram of thenumber of image descriptors assigned to each visual word and it ignores the spatial informa-tion of the image. To solve this problem, Spatial Pyramid Matching (SPM) model has beenput forward in [21], which takes the spatial information of image into account. In fact, SPMis an extension of BoW model and has been proved to achieve better image classificationaccuracy than the latter [15, 36, 38].

In the image classification based on SPM model, there are five steps, i.e., local descrip-tor extraction, dictionary learning, feature coding, spatial pooling and classifier selection.Specifically, the commonly used local descriptors include Scale-Invariant Feature Trans-form (SIFT) [25], Histogram of oriented Gradients (HoG) [7], Affine Scale-InvariantFeature Transform (ASIFT) [28], Oriented Fast and Rotated BRIEF (ORB) [34], etc. Aftergetting all images’ descriptors, vector quantization [21] or sparse coding [38] is utilizedto train a dictionary. In the feature coding phase, each image’s descriptors matrix corre-sponds to a coefficient matrix generated by one different coding strategy. It is necessary toillustrate the principle of spatial pooling clearly because it dominates the whole image clas-sification framework based on SPM model. During the spatial pooling period, an image isdivided into increasingly finer subregions of L layers, with 2l × 2l subregions at layer l,l = 0, 1, · · · , L−1. A typical partition is three layers, i.e., L = 3. At layer 0, the image itselfas a whole; at layer 1, the image is divided into four regions and at layer 2, each subregionof layer 1 is further divided into 4, resulting in 16 smaller subregions. This process gener-ates a spatial pyramid of three layers with a total of 21 subregions. Then, spatial pyramid iscombined with feature coding process and different pooling functions is exploited, i.e., sumpooling [21] and max pooling [36, 38]. Finally, the feature vectors of the 21 subregions areconcatenated into a long feature vector for the whole image. The process mentioned aboveis the spatial pyramid representation of the image. The dimensionality of the new represen-tation for each image is 21P (P is the dictionary size). It is noteworthy that when l = 0,

Page 3: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

SPM reduces to the original BoW model. In the last step, classifiers such as Support VectorMachine (SVM) [5] or Adaptive Boosting (AdaBoost) [11] is applied to classify images.

Over the past several years, a number of dictionary learning methods and feature codingstrategies have been brought forward for image classification. In [6], as one vector quantiza-tion (VQ) technique, K-means clustering algorithm was used to generate dictionary, duringthe feature coding phase, each local descriptor was given a binary value that specified thecluster center which the local descriptor belonged to. This process is called BoW, whichproduces the histograms representation of visual words. However, this approach is likely toresult in large reconstruction error because it limits the ability of representing descriptors.To address this problem, SPM based on sparse coding (ScSPM) method has been proposedin [38], which employed L1 norm-based sparse coding scheme to substitute the previousK-means clustering method and to generate dictionary by learning randomly sampled SIFTfeature vectors. During the feature coding period, ScSPM used sparse coding strategy tocode each local descriptor. However, the computation speed of ScSPM is very slow whenthe dictionary size becomes large. In order to accelerate the computation and maintainhigh classification accuracy, locality-constrained linear coding (LLC) was put forward in[36], which gave an analytical solution for feature coding. Furthermore, several improvedimage classification schemes based SPM have also been suggested recently, such as spa-tial pyramid matching using Laplacian sparse coding [12], discriminative spatial pyramid[15], discriminative affine sparse codes [20], nearest neighbor basis vectors spatial pyramidmatching (NNBVSPM) [24], etc. How to find some efficient feature coding strategies isbecoming an urgent research direction.

In the field of pattern recognition, Fisher vector (FV) technique has been used for imageclassification [4, 19, 29, 30]. FV is a strong framework which combines the advantages ofgenerative and discriminative approaches. The key point of FV is to represent a signal usinga gradient vector derived from a generative probability model and to subsequently inputthis representation to a discriminative classifier. Therefore, FV can be seen as one hybridgenerative/discriminative model. The vector of locally aggregated descriptors (VLAD) canbe viewed as a non-probabilistic version of the FV when the gradient only associates withthe mean and replace gaussian mixture models (GMM) clustering by K-means. VLAD hasbeen successfully applied to image retrieval [1, 8, 16, 17]. When some higher-order statisticsare considered, researchers proposed another two coding methods, i.e., vectors of locallyaggregated tensors (VLAT) [32] and super-vector (SV) [41]. The dimensionality of VLATis P(D + D2), where the D is the dimension of each descriptor, the high dimensionalityrepresentation of VLAT can result in very large computation time. Besides, SV is based onprobability viewpoint and it is still a generative model. Therefore, we do not consider theVLAT and SV feature coding algorithms. In this paper, we only concentrate on some imageclassification methods based on discriminative models, BoW, ScSPM, LLC and VLAD areselected to compare with our improved VLAD methods.

In order to obtain stronger coding ability and improve the classification rate or speed,three improved VLAD versions for image classification are given in this paper. First, similarto the bag of words (BoW) model, we count the number of descriptors belonging to eachcluster center and add it to VLAD. In this way, our improved VLAD method possessesthe characteristics of BoW. Second, in order to expand the impact of residuals, squaredresiduals are added into the original VLAD. This makes the dimension of new representationis two times of the original. Thirdly, there are some descriptors which have nearly the same

Page 4: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

distance to more than one visual words. Thus, these descriptors only assigned to the nearestvisual word in original VLAD are not appropriate. In contrast with one nearest neighborvisual word, we try to look for two nearest neighbor visual words for aggregating eachdescriptor.

The remainder of the paper is organized as follows: Section 2 introduces the basic ideaof existing schemes. Our improved VLAD methods are presented in Section 3. In Section 4,the comparison results of image classification on three widely used datasets are reported.Finally, conclusions are made and some future research issues are discussed in Section 5.

2 Related work

Let V be a set of D-dimensional local descriptors extracted from an image, i.e., V =[v1, v2, · · · , vM ] ∈ R

D×M . Given a dictionary with P entries, W = [w1, w2, · · · , wP ] ∈R

D×P , different feature coding schemes convert each descriptor into a P -dimensionalcode to generate the final image representation coefficient matrix H, i.e., H =[h1, h2, · · · , hM ] ∈ R

P×M . Each column of V is a local descriptor corresponding to acoefficient, i.e., each column of H.

2.1 Bag of words (BoW)

The BoW representation groups local descriptors. It first generates a dictionary W with P

visual words usually obtained by K-means clustering algorithm. Each D dimension localdescriptor from an image is then assigned to the closest center. The BoW representationis obtained as the histogram of the assignment of all image descriptors to visual words.Therefore, it produces a P -dimensional vector representation, the sum of the elements inthis vector equals the number of descriptors in each image. However, the BoW model doesnot consider the spatial structure information of image and has large reconstruction error,its ability to image classification is restricted [6].

2.2 Sparse coding spatial pyramid matching (ScSPM)

In ScSPM [38], by using sparse coding in place of vector quantization followed by multi-layer spatial max pooling, the authors developed an extension of the traditional SPM method[21] and presented a linear SPM kernel based on SIFT sparse coding. In the process ofimage classification, ScSPM solved the following optimization problem:

minW,H

M∑

i=1

‖vi − Whi‖22 + λ‖hi‖1 (1)

where ‖.‖2 denotes the L2 norm of a vector, i.e., the square root of sum of the vectorentries’ squares, ‖.‖1 is the L1 norm of a vector, i.e., the sum of the absolute values ofthe vector entries. The parameter λ is used to control the sparsity of the solution of for-mula (1), the bigger λ is, the more sparse the solution will be. Experimental results in [38]demonstrated that linear SPM based on sparse coding of SIFT descriptors significantly out-performed the linear SPM kernel on histograms and was even better than the nonlinear SPM

Page 5: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

kernels. Nevertheless, utilizing sparse coding to learn dictionary and to encode features aretime consuming, especially for large scale image dataset or large dictionary.

2.3 Locality-constrained linear coding (LLC)

In LLC [36], inspired by the viewpoint of [40] which illustrated that locality was moreimportant than sparsity, the authors generalized the sparse coding to locality-constrainedlinear coding and suggested a locality constraint instead of the sparsity constraint in theformula (1). With respect to LLC, the following optimization problem was solved:

minH

M∑

i=1

‖vi − Whi‖22 + λ‖di � hi‖2

2

s.t. 1T hi = 1,∀i (2)

where 1 = (1, 1, · · · , 1)T , � denotes the element-wise multiplication, and di ∈ RP is a

weight vector. In addition, each coefficient vector hi is normalized in terms of 1T hi = 1.Experimental results in [36] showed that the LLC outperformed ScSPM on some benchmarkdatasets due to its excellent properties, i.e., better reconstruction, local smooth sparsity andanalytical solution.

2.4 Vector of locally aggregated descriptors (VLAD)

VLAD representation was proposed in [16] for image retrieval. V = [v1, v2, · · · , vM ] ∈R

D×M represents a descriptor set extracted from an image. Like the BoW, a dictionaryW = [w1, w2, · · · , wP ] ∈ R

D×P is first learned using K-means. Then, for each localdescriptor vi , we look for its nearest neighbor visual word NN(vi ) in the dictionary. Finally,for each visual word wj , the differences vi − wj of the vectors vi assigned to wj are accu-mulated. C = [cT

1 , cT2 , · · · , cT

P ]T ∈ RPD(cj ∈ R

D, j = 1, 2, · · · , P ) is the final vectorrepresentation of VLAD, which can be obtained according to the following formula.

cj =∑

vi :NN(vi )=wj

(vi − wj ) (3)

The VLAD representation is the concatenation of the D dimensional vectors cj and istherefore PD dimension, where P is the dictionary size. Algorithm 1 gives the VLADcoding process. Like the Fisher vector, the VLAD can then be power- and L2-normalizedsequently, where the parameter α is empirically set to 0.5. It is worth noting that there areno SPM and pooling process in the VLAD coding algorithm. The existing experiments haveproved that VLAD is an efficient feature coding method under small dictionary size.

3 Improved VLAD

In this section, three improved VLAD methods are presented. They are named as VLADbased on BoW, Magnified VLAD and Two Nearest Neighbor VLAD respectively. The same

Page 6: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

as VLAD, the improved VLAD representations can also be power- and L2-normalized,where the parameter α is empirically set to 0.5.

3.1 VLAD based on BoW

Inspired by the BoW, we count the number of descriptors belonging to each clus-ter wj (j = 1, · · · , P ) and add it to VLAD. This improved VLAD method is calledVLAD based on BoW (abbreviated as: VLAD+BoW). Therefore, the dimensionalityof VLAD+BoW representation is P(D + 1), and the extra one dimension is used tostore the BoW representation. After integrating the histogram information of visualwords into the VLAD, we hope that VLAD+BoW can possess the characteristics ofBoW and improve the classification performance. The VLAD+BoW is presented inAlgorithm 2.

3.2 Magnified VLAD

In order to magnify the impact of residuals, squared residuals are taken into account. Thisimproved version is called Magnified VLAD (abbreviated as: MVLAD) and its dimensionis 2PD. The computation of MVLAD is given in Algorithm 3.

Page 7: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

3.3 Two nearest neighbor VLAD

In addition to a nearest neighbor center, we attempt to seek a second nearest neighbor centerfor each descriptor. This process is referred to two nearest neighbor VLAD (abbreviatedas: TNNVLAD). The dimension of TNNVLAD representation is still PD. TNNVLAD isa kind of soft coding method and it can reduce representation error. The specific detailsare showed in Algorithm 4. If d1 > βd2, the 0.5 times differences between vi and its twonearest neighbor centers are accumulated. The value of β can be obtained according to ourexperiments.

4 Experimental results

This section begins with an illustration of our experiments setting which is followed bycomparisons between our schemes with other prominent methods on three datasets, i.e.,UIUC Sports Event, Corel 10 and 15 Scenes. Figure 1 shows example images of thesedatasets.

Page 8: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

4.1 Experiments setting

A typical experiments setting for classifying images mainly contains four steps. First of all,we adopt the widely used SIFT descriptor [25] due to its good performance in image clas-sification reported in [12, 21, 36, 38]. Specifically speaking, SIFT features are invariant toimage scale and rotation and robust across a substantial range of affine distortion, additionof noise, and change in illumination. To be consistent with previous work, we also draw onthe same setting to extract SIFT descriptor. We employ the 128-dimensional SIFT descrip-tor which are densely extracted from image patches on a grid with step size of 8 pixelsunder one patch size, i.e., 16 × 16. We resize the maximum side (i.e., length or width) ofeach image to 300 pixels except for UIUC Sports Event. For UIUC Sports Event dataset,we resize the maximum side to 400 because of the high resolution of original images. Next,about twenty descriptors from each image are chosen at random to form a new matrix whichis taken as an input of K-means clustering or sparse coding algorithm, and we then learna dictionary of specified size. In the third step, we then exploit BoW, sparse coding, LLC,VLAD and improved VLAD schemes to encode the descriptors and produce image’s newrepresentation. For the BoW model, the dimensionality of the new representation is dictio-nary size P . In the ScSPM and LLC, we combined three layers spatial pyramid matchingmodel (including 21 subregions) with max pooling function, thus, the dimension of the newrepresentation is 21P . The dimensionality for the VLAD and the improved VLAD meth-ods can be found from the Algorithms 1-4. At the final step, we apply linear SVM classifier

Page 9: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

Fig. 1 Image examples of the datasets UIUC Sports Event (the left four), Corel 10 (the middle four), and 15Scenes (the right four)

Page 10: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

for the new representations, randomly selecting some columns per class to train and someother columns per class to test. Then, it is not difficult for us to get a classification accuracyfor each category by comparing the obtained label of test set with the ground-truth label oftest set. Eventually, we sum up classification accuracy of all categories and divide it by thenumber of categories to get the classification accuracy of all categories. All the results areobtained by repeating five independent experiments, and the average classification accuracyand the standard deviation over five experiments are reported. All the experiments are con-ducted in MATLAB, which is executed on a server with an Intel X5650 CPU (2.66GHz and12 cores) and 32GB RAM.

For the TNNVLAD algorithm, Fig. 2 gives the choice process of parameter β on threedifferent datasets. Specifically speaking, Fig. 2 shows the classification accuracy of ourTNNVLAD method when β changes in an interval [0.1, 1] where the dictionary size is 130.The experimental results presented in Fig. 2 indicate that β = 0.8 is the best choice forTNNVLAD. Therefore, in our experiments, we fix β = 0.8 in TNNVLAD algorithm.

4.2 UIUC sports event dataset

UIUC Sports Event [23] contains 8 categories and 1579 images in total, with the number ofimages within each category ranging from 137 to 250. These 8 categories are badminton,bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding. In order to comparewith other methods, we first randomly select 70 images per class as training data and ran-domly select 60 images from each class as test data. We compare the classification accuracyof our three improved VLAD schemes with other four methods under different dictionary

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 175

76

77

78

79

80

81

82

83

84

85

beta

UIUC Sports EventCorel 1015 Scenes

Fig. 2 Classification accuracy of our TNNVLAD algorithm under different β on the UIUC Sports Event,Corel 10 and 15 Scenes datasets

Page 11: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

50 100 150 200 250 300 350 40050

55

60

65

70

75

80

85

90

Dictionary Size

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 3 Classification accuracy comparisons of various coding methods under different dictionary size on theUIUC Sports Event dataset

size in Fig. 3, where the dictionary size ranges from 10 to 420 and the step length is 10.From the results presented in Fig. 3, we notice that the classification accuracy of our meth-ods surpass all the other algorithms when the dictionary size is small and are comparableto the existing schemes when the dictionary size becomes large. This phenomenon may beexplained for the fact that the goal of VLAD is for aggregating local image descriptors intocompact codes. VLAD can obtain good performance in the case of small dictionary size.Besides, we can know the results from the Fig. 3 that the performance of BoW is the lowestand ScSPM is better than BoW, yet, the classification accuracy of LLC is further better thanScSPM, these observations are consistent with reports in the existing literature sources.

Based on Fig. 3, we list the best classification accuracy of various approaches in Table 1,where the average classification accuracy, standard deviation and corresponding dictionary

Table 1 The best classification accuracy comparisons on the UIUC Sports Event dataset (mean±std-dev)%

Algorithm Classification Accuracy (Dictionary Size)

BoW [6] 73.38 ± 0.85 (390)

ScSPM [38] 83.71 ± 2.20 (400)

LLC [36] 84.17 ± 1.36 (330)

VLAD [17] 84.38 ± 2.67 (220)

VLAD+BoW 85.29 ± 0.87 (210)

MVLAD 84.75 ± 1.85 (220)

TNNVLAD 85.25 ± 1.26 (220)

Page 12: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

95.67

6.33

0.33

1.67

0.00

2.33

0.00

0.67

1.00

64.33

18.67

5.00

0.33

0.00

0.00

3.33

0.67

10.00

71.67

3.33

0.00

0.67

1.33

1.00

0.67

2.33

2.33

80.67

0.33

3.33

1.00

1.67

0.00

4.67

3.00

1.67

96.67

1.33

0.00

5.33

0.33

5.00

1.33

4.00

0.33

87.67

5.33

1.33

0.00

0.67

2.67

0.67

0.33

2.33

90.33

0.33

1.67

6.67

0.00

3.00

2.00

2.33

2.00

86.33

badm

into

n

bocc

e

croq

uet

polo

rock

clim

bing

row

ing

saili

ng

snow

boar

ding

The Confusion Matrix of VLAD+BoW algorithm on UIUC Sports Event (%)

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboarding

94.33

4.67

0.33

3.00

0.00

0.67

1.00

0.33

1.33

58.67

19.67

5.33

0.67

2.33

0.33

2.67

2.67

13.33

70.00

4.67

0.00

2.00

1.33

1.33

0.33

4.33

3.33

79.33

0.33

1.33

1.00

0.67

0.00

5.67

3.00

1.67

94.00

1.33

0.00

3.67

0.33

4.67

0.67

2.33

0.67

87.00

4.33

4.33

0.00

0.33

2.67

0.67

0.00

3.00

91.00

0.67

1.00

8.33

0.33

3.00

4.33

2.33

1.00

86.33

badm

into

n

bocc

e

croq

uet

polo

rock

clim

bing

row

ing

saili

ng

snow

boar

ding

The Confusion Matrix of MVLAD algorithm on UIUC Sports Event (%)

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboarding

93.00

3.67

0.00

1.33

0.33

0.67

0.67

1.33

1.67

65.33

20.00

3.33

1.67

2.00

0.33

2.33

0.67

10.67

74.33

4.33

0.33

1.33

0.67

1.00

1.33

6.67

2.00

85.33

0.67

3.00

0.33

3.33

0.67

4.00

1.33

0.67

91.00

1.67

0.00

2.67

1.00

2.33

1.33

2.33

1.67

85.67

6.33

2.67

0.33

0.67

1.00

0.67

0.33

4.00

90.67

1.33

1.33

6.67

0.00

2.00

4.00

1.67

1.00

85.33

badm

into

n

bocc

e

croq

uet

polo

rock

clim

bing

row

ing

saili

ng

snow

boar

ding

The Confusion Matrix of TNNVLAD algorithm on UIUC Sports Event (%)

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboarding

Fig. 4 Confusion Matrices of our algorithms on UIUC Sports Event dataset

Page 13: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

size are given. From Table 1, we can draw the conclusion that the best classification accu-racy of our three improved methods are better than those of the other four schemes onthe UIUC Sports Event dataset. Our VLAD+BoW and TNNVLAD methods achieve morethan 1 % higher accuracy than LLC, which is the state-of-the-art and is based on SPMmodel. Furthermore, the original VLAD and improved VLAD can get the best classificationaccuracy under small dictionary size, but the BoW, ScSPM and LLC obtain their highestclassification accuracy needing large dictionary size.

Moreover, the confusion matrices of our algorithms for UIUC Sports Event dataset areshown in Fig. 4. In the process of obtaining confusion matrices, the dictionary size is set to130 in our three improved VLAD methods. In the confusion matrices, the element in the ith

row and j th column (i �= j) is the percentage of images from class i that are misidentifiedas class j . Average classification accuracies of five independent experiments for individualclasses are listed along the main diagonal. Figure 4 shows the classification and misclassifi-cation status for each individual class. Our algorithms perform well for class badminton androck climbing. What is more, we also notice that the class bocce and croquet have a highpercentage being classified wrongly, and this may result from that they are visually similarto each other. Balls in the class bocce and croquet have very similar appearance.

To further demonstrate the superiority of our methods in running speed, the computa-tion time comparisons of various approaches with different dictionary size on the UIUCSports Event dataset is reported in Fig. 5. The computation time of all methods is the totaltime of five independent experiments and the corresponding unit is seconds. From Fig. 5,we can know that the computing speed of BoW method is the fastest due to its low dimen-sional representation. Meanwhile, we also observe that ScSPM algorithm is the slowest.This is because that sparse coding strategy is used to learn a dictionary and to encode fea-tures in ScSPM. To solve the optimization problem of minimizing the L1 norm is verytime-consuming. The computation time of VLAD and our three improved VLAD methods

50 100 150 200 250 300 350 4000

1000

2000

3000

4000

5000

6000

7000

8000

Dictionary Size

Com

puta

tion

Tim

e (s

econ

ds)

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 5 Computation time comparisons of various coding methods under different dictionary size on theUIUC Sports Event dataset

Page 14: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

50 100 150 200 250 300 350 40035

40

45

50

55

60

65

70

75

80

85

Dictionary Size

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 6 Classification accuracy comparisons of various coding methods under different dictionary size on theCorel 10 dataset

are smaller than LLC. This experimental results show that our algorithms have a certainadvantage on the computation time.

4.3 Corel 10 dataset

Corel 10 [26] contains 10 categories and 100 images per category. These categories arebeach, buildings, elephants, flowers, food, horses, mountains, owls, skiing and tigers. Likethe setting of [12, 26], we randomly select 50 images from each class as training data and usethe rest 50 images per class as test data. Similarly, classification accuracy comparisons ofvarious coding methods under different dictionary size on the Corel 10 dataset are describedin Fig. 6. We again see that our improved VLAD algorithms can obtain good performancewhen the dictionary size is small.

According to Fig. 6, the best classification accuracy of different algorithms are reportedin Table 2. From the results, we can see that the best classification accuracies of our threeimproved VLAD algorithms are better than those of the other four schemes on the Corel 10

Table 2 The best classification accuracy comparisons on the Corel 10 dataset (mean±std-dev)%

Algorithm Classification Accuracy (Dictionary Size)

BoW [6] 67.44 ± 0.91 (340)

ScSPM [38] 75.24 ± 1.24 (340)

LLC [36] 79.20 ± 1.66 (380)

VLAD [17] 78.76 ± 1.47 (110)

VLAD+BoW 79.88 ± 0.48 (130)

MVLAD 79.96 ± 1.20 (280)

TNNVLAD 81.32 ± 1.45 (130)

Page 15: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

72.80

3.20

0.40

0.00

2.00

1.60

4.80

7.60

2.40

0.80

2.80

84.40

1.20

0.00

0.80

0.40

0.40

0.00

6.00

0.80

4.00

1.20

85.60

0.00

0.00

0.80

1.20

2.40

3.20

2.00

0.40

1.20

0.00

91.60

10.80

0.00

8.00

8.00

1.60

3.20

0.80

1.20

0.40

2.00

81.20

0.00

7.20

0.00

2.80

2.00

2.00

0.80

6.40

0.00

0.00

96.40

0.80

2.00

0.80

2.80

8.80

1.20

0.80

2.80

1.60

0.00

52.40

3.60

2.00

2.80

2.40

2.40

0.00

2.40

0.00

0.00

3.20

65.20

2.80

4.40

3.60

3.20

2.00

0.40

1.60

0.00

6.00

4.00

74.00

1.20

2.40

1.20

3.20

0.80

2.00

0.80

16.00

7.20

4.40

80.00

beac

h

build

ings

elep

hant

s

food

hors

es

mou

ntai

ns

owls

skiin

g

tiger

The Confusion Matrix of VLAD+BoW algorithm on Corel 10 (%)

beach

buildings

elephants

food

horses

mountains

owls

skiing

tiger

67.60

3.60

0.40

0.00

0.00

1.60

6.00

3.60

2.00

2.00

5.60

82.00

1.20

0.00

2.80

2.80

1.20

0.40

6.00

0.80

3.60

2.00

88.40

0.00

0.00

0.40

3.20

1.60

1.60

1.20

4.00

0.40

0.00

93.20

10.00

0.00

8.40

10.80

0.80

4.40

0.40

1.20

0.00

0.80

77.20

0.00

2.00

2.00

1.60

0.80

2.40

0.80

6.40

0.00

1.20

95.20

3.20

4.80

0.80

3.60

8.80

1.60

0.00

1.60

1.60

0.00

52.80

3.60

3.20

1.60

3.60

3.20

2.00

1.20

0.40

0.00

3.60

67.20

5.20

2.80

2.40

3.60

1.20

1.60

3.60

0.00

5.20

1.60

72.00

3.60

1.60

1.60

0.40

1.60

3.20

0.00

14.40

4.40

6.80

79.20

beac

h

build

ings

elep

hant

s

food

hors

es

mou

ntai

ns

owls

skiin

g

tiger

The Confusion Matrix of MVLAD algorithm on Corel 10 (%)

beach

buildings

elephants

food

horses

mountains

owls

skiing

tiger

74.00

4.00

0.00

0.00

0.00

1.20

4.40

6.40

3.60

2.00

2.80

85.20

1.60

0.00

2.00

0.40

0.80

0.40

9.20

1.20

4.40

0.40

84.40

0.00

0.00

1.60

2.80

3.60

1.60

0.80

1.60

0.40

0.00

91.60

7.20

0.00

4.40

5.60

2.80

1.20

0.00

0.00

0.00

3.60

78.40

0.00

5.60

0.40

3.20

0.80

1.20

1.20

8.80

0.40

0.80

95.60

2.00

5.20

0.80

2.80

8.40

0.40

0.00

1.20

2.40

0.00

52.00

3.60

2.80

1.60

3.60

4.00

2.40

2.00

1.60

0.40

2.80

66.40

4.80

1.60

2.00

1.20

1.60

0.40

4.00

0.00

10.40

2.00

66.80

0.80

2.00

3.20

1.20

0.80

3.60

0.80

14.80

6.40

4.40

87.20

beac

h

build

ings

elep

hant

s

food

hors

es

mou

ntai

ns

owls

skiin

g

tiger

The Confusion Matrix of TNNVLAD algorithm on Corel 10 (%)

beach

buildings

elephants

food

horses

mountains

owls

skiing

tiger

Fig. 7 Confusion Matrices of our algorithms on Corel 10 dataset

Page 16: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

dataset. Moreover, all the algorithms based on VLAD obtain the best classification accuracyunder small dictionary size, but the BoW, ScSPM and LLC get their best classificationaccuracy needing large dictionary size. Our TNNVLAD method has two percentage pointhigher than the other best method LLC.

The confusion matrices for Corel 10 dataset are also given in Fig. 7. Our algorithmsperform well for class flower and horse, and get poor performance on class mountain.

Figure 8 gives the computation time comparisons of various coding methods under dif-ferent dictionary size on Corel 10 dataset. ScSPM algorithm requires the most time thanother six algorithms. Although MVLAD needs more time than BoW and LLC, but it stillfar less than ScSPM.

4.4 15 Scenes dataset

The 15 Scenes dataset [21] contains 15 categories and 4485 images in total, with the numberof images within each category ranging from 200 to 400. These 15 categories are bedroom,suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, opencountry, street, tall building, office and store. The image content is different, containingnot only indoor scenes, like livingroom and store, but also outdoor sceneries, such as coastand forest etc. In order to compare with other methods, we randomly select 100 images perclass as training data and use the rest as test data. Figure 9 gives the classification accuracycomparisons of various coding methods under different dictionary size on the 15 Scenesdataset. Algorithms based on VLAD cat get better performance than ScSPM and LLC whenthe dictionary size is small, but they become slightly lower than LLC when the dictionarysize increases.

50 100 150 200 250 300 350 4000

500

1000

1500

2000

2500

3000

3500

Dictionary Size

Com

puta

tion

Tim

e (s

econ

ds)

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 8 Computation time comparisons of various coding methods under different dictionary size on Corel10 dataset

Page 17: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

50 100 150 200 250 300 350 40045

50

55

60

65

70

75

80

85

Dictionary Size

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 9 Classification accuracy comparisons of various coding methods under different dictionary size on the15 Scenes dataset

On the basis of data in Fig. 9, the most prominent classification accuracy are presented inTable 3. For the 15 Scenes dataset, the best performance of our improved VLAD algorithmsare comparable with or slightly lower than LLC and ScSPM.

The confusion matrices for 15 Scenes dataset are shown in Fig. 10. Our algorithms per-form well for class calsuburb and forest. Besides, we know that the class bedroom and livingroom have a high percentage being classified wrongly, meanwhile, the class kitchen andliving room also have high misclassification rate, and these may result from that they arevisually similar to each other.

Figure 11 reports the computation time comparisons of various coding methods underdifferent dictionary size on 15 Scenes dataset. ScSPM algorithm requires the most time thanother six algorithms.

Table 3 The best classification accuracy comparisons on the 15 Scenes dataset (mean±std-dev)%

Algorithm Classification Accuracy (Dictionary Size)

BoW [6] 65.87 ± 0.61 (90)

ScSPM [38] 79.54 ± 0.70 (420)

LLC [36] 80.85 ± 1.02 (420)

VLAD [17] 77.35 ± 0.50 (400)

VLAD+BoW 80.09 ± 0.51 (280)

MVLAD 78.82 ± 0.50 (280)

TNNVLAD 79.23 ± 0.62 (400)

Page 18: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

62.76

0.00

3.13

5.64

17.78

0.00

0.00

0.13

0.10

0.00

0.39

0.00

0.16

1.39

1.02

0.69

99.86

1.71

0.00

0.63

0.00

0.26

0.13

1.44

0.07

1.23

0.21

0.08

0.70

0.65

3.10

0.00

60.85

3.27

5.50

0.00

0.44

0.38

1.35

0.07

0.06

0.42

0.31

0.00

4.56

5.86

0.00

2.37

62.00

12.70

0.00

0.00

0.13

1.15

0.00

0.00

0.00

0.16

6.43

3.44

13.62

0.00

3.98

10.18

44.66

0.00

0.00

0.25

0.29

0.15

0.00

0.00

0.08

1.39

5.58

0.17

0.00

1.71

0.18

0.00

86.85

0.00

4.13

0.19

1.53

10.97

0.00

0.00

0.00

0.00

0.00

0.00

0.28

0.00

0.11

0.62

95.88

0.00

0.48

2.41

5.16

0.00

0.39

0.00

1.58

0.69

0.00

0.09

0.00

0.32

2.38

0.00

88.25

0.10

1.39

2.06

2.71

0.00

0.00

0.00

1.90

0.00

2.65

1.64

0.85

0.00

0.00

1.63

82.69

0.00

0.00

6.15

2.73

0.00

3.91

0.00

0.00

0.76

0.18

0.42

2.31

2.11

0.38

0.10

89.56

5.61

0.63

0.86

0.00

0.84

0.00

0.00

0.57

0.00

0.00

7.31

0.61

1.88

0.10

4.31

73.94

0.21

0.00

0.00

0.00

0.17

0.00

1.52

0.00

0.42

0.08

0.00

1.88

5.87

0.22

0.52

86.98

0.94

0.00

1.02

0.00

0.00

4.64

0.00

0.42

0.31

0.00

0.38

3.27

0.22

0.00

1.88

93.67

0.00

1.30

6.90

0.00

1.42

10.55

7.41

0.08

0.00

0.00

0.38

0.00

0.00

0.00

0.08

88.70

2.05

4.14

0.14

14.31

6.36

8.78

0.08

0.70

0.50

2.50

0.07

0.06

0.83

0.55

1.39

74.05

bedr

oom

cals

ubur

b

indu

stria

l

kitc

hen

livin

groo

m

coas

t

fore

st

high

way

insi

deci

ty

mou

ntai

n

open

coun

try

stre

et

tallb

uild

ing

stor

e

The Confusion Matrix of VLAD+BoW algorithm on 15 Scenes (%)

bedroomcalsuburbindustrial

kitchenlivingroom

coastforest

highwayinsidecitymountain

opencountrystreet

tallbuilding

store

55.52

0.00

2.56

6.00

16.30

0.15

0.00

0.00

0.10

0.07

0.26

0.10

0.31

1.74

1.12

0.34

99.43

2.18

0.00

0.85

0.08

0.26

0.38

1.15

0.22

2.00

0.10

0.23

0.70

1.02

2.93

0.00

53.74

1.64

4.44

0.00

0.00

0.00

1.35

0.07

0.39

0.52

1.64

0.17

2.42

12.76

0.00

2.56

58.73

12.06

0.00

0.00

0.13

0.96

0.00

0.00

0.00

0.08

8.17

3.16

14.66

0.14

3.32

12.91

46.46

0.00

0.00

0.50

0.29

0.22

0.00

0.21

0.23

1.04

4.74

0.00

0.00

1.80

0.00

0.00

86.38

0.00

4.25

0.29

1.90

13.48

0.31

0.23

0.17

0.00

0.00

0.00

0.38

0.00

0.00

0.31

96.58

0.13

0.77

3.87

7.03

0.00

0.23

0.00

0.84

0.86

0.00

1.90

0.00

0.32

3.23

0.00

86.75

0.38

1.53

2.45

2.71

0.08

0.00

0.65

1.55

0.14

3.41

3.64

1.06

0.08

0.00

1.88

81.54

0.15

0.00

4.48

3.28

0.00

7.53

0.00

0.00

1.23

0.36

0.53

1.38

2.02

1.13

0.19

87.30

4.90

0.73

1.33

0.00

1.95

0.00

0.00

1.71

0.00

0.00

7.85

0.70

2.38

0.29

4.01

68.71

0.00

0.00

0.00

0.00

0.00

0.00

3.51

0.18

0.00

0.31

0.09

1.63

6.44

0.36

0.32

86.88

1.25

0.00

1.49

0.69

0.14

6.26

1.09

0.74

0.15

0.00

0.25

3.27

0.15

0.00

2.92

90.39

0.00

0.84

6.55

0.00

0.76

9.27

7.51

0.08

0.00

0.00

0.29

0.00

0.00

0.00

0.31

86.78

3.16

4.14

0.14

14.69

6.18

9.74

0.00

0.35

0.63

2.69

0.15

0.45

1.04

0.39

1.22

71.07

bedr

oom

cals

ubur

b

indu

stria

l

kitc

hen

livin

groo

m

coas

t

fore

st

high

way

insi

deci

ty

mou

ntai

n

open

coun

try

stre

et

tallb

uild

ing

stor

e

The Confusion Matrix of MVLAD algorithm on 15 Scenes (%)bedroom

calsuburbindustrial

kitchenlivingroom

coastforest

highwayinsidecitymountain

opencountrystreet

tallbuilding

store

63.45

0.14

4.08

6.36

17.88

0.23

0.00

0.13

0.67

0.29

0.32

0.10

0.39

2.09

0.93

0.52

98.72

1.61

0.00

0.42

0.23

0.44

0.38

1.15

0.22

1.74

0.21

0.08

1.04

1.12

1.72

0.00

55.07

0.73

5.08

0.08

0.44

0.25

0.77

0.29

0.45

1.15

2.03

0.17

4.47

7.24

0.00

3.22

63.45

13.44

0.00

0.00

0.25

1.25

0.00

0.00

0.00

0.16

5.39

4.28

15.52

0.43

3.79

10.55

49.42

0.00

0.00

0.00

0.48

0.22

0.00

0.00

0.47

2.26

3.44

0.17

0.00

2.27

0.00

0.00

86.38

0.09

4.50

0.87

1.39

10.84

0.31

0.23

0.00

0.00

0.00

0.14

0.47

0.00

0.00

0.62

96.05

0.38

0.58

2.92

6.00

0.00

0.39

0.00

1.30

0.34

0.00

1.71

0.18

0.42

2.69

0.00

88.00

0.67

1.53

2.13

2.60

0.16

0.00

0.56

1.21

0.14

4.08

3.27

1.38

0.00

0.00

1.75

80.77

0.00

0.13

4.48

3.05

0.52

5.02

0.00

0.00

0.95

0.18

0.42

1.62

2.28

0.75

0.00

87.81

6.84

0.63

0.78

0.00

3.07

0.00

0.28

1.42

0.00

0.00

7.92

0.35

1.63

0.10

4.23

70.45

0.00

0.16

0.00

0.09

0.00

0.00

3.22

0.00

0.21

0.00

0.09

1.25

5.38

0.51

0.19

86.98

1.64

0.00

2.23

0.34

0.00

4.93

0.73

0.21

0.15

0.00

0.50

3.65

0.00

0.58

2.19

90.00

0.00

1.95

5.00

0.00

1.42

9.64

5.71

0.08

0.00

0.13

0.48

0.15

0.00

0.00

0.23

86.61

2.88

4.48

0.14

11.75

4.91

5.40

0.00

0.26

0.13

3.17

0.44

0.32

1.35

0.23

1.91

68.65

bedr

oom

cals

ubur

b

indu

stria

l

kitc

hen

livin

groo

m

coas

t

fore

st

high

way

insi

deci

ty

mou

ntai

n

open

coun

try

stre

et

tallb

uild

ing

stor

e

The Confusion Matrix of TNNVLAD algorithm on 15 Scenes (%)bedroom

calsuburbindustrial

kitchenlivingroom

coastforest

highwayinsidecitymountain

opencountrystreet

tallbuilding

store

Fig. 10 Confusion Matrices of our algorithms on 15 Scenes dataset

Page 19: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

50 100 150 200 250 300 350 4000

2000

4000

6000

8000

10000

12000

Dictionary Size

Com

puta

tion

Tim

e (s

econ

ds)

BoWScSPMLLCVLADVLAD+BoWMVLADTNNVLAD

Fig. 11 Computation time comparisons of various coding methods under different dictionary size on the 15Scenes dataset

5 Conclusion and future work

In this paper, three feature coding schemes based on VLAD are proposed for image clas-sification. We compare our schemes with some state-of-the-art methods, including BoW,ScSPM, LLC and VLAD. Experiments on different kinds of datasets (UIUC Sports Eventdataset, Corel 10 dataset and 15 Scenes dataset) demonstrate that classification accuracy ofour improved VLAD coding strategies are better than the previous four classical methodsunder small dictionary size. At the same time, it is noteworthy that our schemes are muchfaster than ScSPM because ScSPM algorithm needs more time to learn dictionary and codefeatures using sparse coding strategy. In many cases, we need to consider the classificationaccuracy and classification speed simultaneously. In the future, we will try to find moreefficient feature coding strategies and apply them to large scale image datasets.

Acknowledgments This work is sponsored by NUPTSF (Grant No. NY214168), National Natural ScienceFoundation of China (Grant No. 61300164, 61272247), Shanghai Science and Technology Committee (GrantNo. 13511500200) and European Union Seventh Framework Programme (Grant No. 247619).

References

1. Arandjelovic R, Zisserman A (2013) All about vlad. In: IEEE conference on computer vision and patternrecognition, pp 1578–1585

2. Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In:IEEE conference on computer vision and pattern recognition, pp 1–8

3. Bosch A, Zisserman A, Muoz X (2008) Scene classification using a hybrid generative/discriminativeapproach. IEEE Trans Pattern Anal Mach Int 30(4):712–727

4. Cinbis RG, Verbeek J, Schmid C (2012) Image categorization using fisher kernels of non-iid imagemodels. In: IEEE conference on computer vision and pattern recognition, pp 2184–2191

5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

Page 20: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

6. Csurka G, Dance CR, Fan LX, Willamowski J, Bray C (2004) Visual categorization with bags ofkeypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, p 22

7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference oncomputer vision and pattern recognition, vol 1, pp 886–893

8. Delhumeau J, Gosselin PH, Jegou H, Perez P (2013) Revisiting the vlad image representation. In: ACMinternational conference on Multimedia, pp 653–656

9. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learneddictionaries. IEEE Trans Image Proc 15(12):3736–3745

10. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples:An incremental bayesian approach tested on 101 object categories. Comp Vision Image Underst 106(1):59–70

11. Freund Y, Schapire R (1995) A desicion-theoretic generalization of on-line learning and an applicationto boosting. In: Computational learning theory, pp 23–37

12. Gao SH, Tsang IWH, Chia LT, Zhao PL (2010) Local features are not lonely–laplacian sparsecoding for image classification. In: IEEE conference on computer vision and pattern recognition,pp 3555–3561

13. Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of imagefeatures. In: International conference on computer vision, vol 2, pp 1458–1465

14. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset15. Harada T, Ushiku Y, Yamashita Y, Kuniyoshi Y (2011) Discriminative spatial pyramid. In: IEEE

conference on computer vision and pattern recognition, pp 1617–162416. Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image

representation. In: IEEE conference on computer vision and pattern recognition, pp 3304–331117. Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image

descriptors into compact codes. IEEE Trans Pattern Anal Mach Int 34(9):1704–171618. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: International conference

on computer vision, vol 1, pp 604–61019. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization.

In: IEEE international conference on computer vision, pp 1487–149420. Kulkarni N, Li BX (2011) Discriminative affine sparse codes for image classification. In: IEEE

conference on computer vision and pattern recognition, pp 1609–161621. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2,pp 2169–2178

22. Li FF, Pietro P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEEconference on computer vision and pattern recognition, vol 2, pp 524–531

23. Li LJ, Li FF (2007) What, where and who? Classifying events by scene and object recognition. In:International conference on computer vision, pp 1–8

24. Long X, Lu H, Li W (2012) Image classification based on nearest neighbor basis vectors. MultimedTools Appl:1–18

25. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

26. Lu Z, Ip HHS (2009) Image categorization with spatial mismatch kernels. In: IEEE conference oncomputer vision and pattern recognition, pp 397–404

27. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clusteringforests. Advances in neural information processing systems 19

28. Morel J, Yu G (2009) Asift: a new framework for fully affine invariant image comparison. SIAM JImaging Sci 2(2):438–469

29. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: IEEEconference on computer vision and pattern recognition, pp 1–8

30. Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classifica-tion. In: European conference on computer vision, pp 143–156

31. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabular-ies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition,pp. 1–8

32. Picard D, Gosselin PH (2011) Improving image similarity with vectors of locally aggregated tensors. In:IEEE international conference on image processing, pp 669–672

33. Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling sceneswith local descriptors and latent aspects. In: International conference on computer vision, vol 1, pp883–890

Page 21: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

34. Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: An efficient alternative to sift or surf. In:International conference on computer vision

35. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In:International conference on computer vision, pp 1470–1477

36. Wang JJ, Yang JC, Yu K, Lv FJ, Huang T, Gong YH (2010) Locality-constrained linear cod-ing for image classification. In: IEEE conference on computer vision and pattern recognition,pp 3360–3367

37. Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment.IEEE Trans Pattern Anal Mach Int 30(11):1985–1997

38. Yang JC, Yu K, Gong YH, Huang T (2009) Linear spatial pyramid matching using sparse cod-ing for image classification. In: IEEE conference on computer vision and pattern recognition,pp 1794–1801

39. Yang L, Jin R, Sukthankar R, Jurie F (2008) Unifying discriminative visual codebook generation withclassifier training for object category recognition. In: IEEE conference on computer vision and patternrecognition, pp 1–8

40. Yu K, Zhang T, Gong YH (2009) Nonlinear learning using local coordinate coding. Adv Neural InfProcess Syst 22:2223–2231

41. Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local imagedescriptors. In: European conference on computer vision, pp 141–154

Xianzhong Long obtained his Ph.D. degree from Shanghai Jiao Tong University on June 2014. He receivedhis B.S. degree from Henan Polytechnic University in 2007 and M.S. degree from Xihua University in 2010,both in computer science. Now, he is an assistant professor at Nanjing University of Posts and Telecommu-nications. His research interests are computer vision, machine learning and image processing, specifically onimage classification, object recognition and clustering.

Page 22: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

Hongtao Lu got his Ph.D. degree in Electronic Engineering from Southeast University, Nanjing, in 1997.After graduation he became a postdoctoral fellow in Department of Computer Science, Fudan University,Shanghai, China, where he spent two years. In 1999, he joined the Department of Computer Science andEngineering, Shanghai Jiao Tong University, Shanghai, where he is now a professor. His research interestincludes machine learning, computer vision and pattern recognition, and information hiding. He has pub-lished more than sixty papers in international journals such as IEEE Transactions, Neural Networks and ininternational conferences. His papers got more than 400 citations by other researchers.

Yong Peng received the B.S degree in computer science from Hefei New Star Research Institure of AppliedTechnology, the M.S degree from Graduate University of Chinese Academy of Sciences. Now he is workingtowards his PhD degree in Shanghai Jiao Tong University. His research interests include machine learning,pattern recognition and evolutionary computation.

Page 23: Image classification based on improved VLADbcmi.sjtu.edu.cn/~pengyong/Pub2015/MTA2015.pdf · three improved VLAD versions for image classification are given in this paper. First,

Multimed Tools Appl

Xianzhong Wang received the B.S degree in computer science from An Hui University Of Technology.Nowhe is a Master candidate in Computer Science and Engineering Department of Shanghai Jiao Tong University.His research interests include machine learning and human action recognition.

Shaokun Feng received the B.S degree in information science from University of Shanghai for Science andTechnology. Now he is working towards his M.S degree in Shanghai Jiao Tong University. His researchinterests include machine learning, pattern recognition and deep learning.