5324 ieee transactions on image processing, vol. 26, no. …see.xidian.edu.cn/faculty/chdeng/welcome...

13
5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017 Distributed Adaptive Binary Quantization for Fast Nearest Neighbor Search Xianglong Liu, Member, IEEE, Zhujin Li, Cheng Deng, Member, IEEE, and Dacheng Tao, Fellow, IEEE Abstract—Hashing has been proved an attractive technique for fast nearest neighbor search over big data. Compared with the projection based hashing methods, prototype-based ones own stronger power to generate discriminative binary codes for the data with complex intrinsic structure. However, existing prototype-based methods, such as spherical hashing and K-means hashing, still suffer from the ineffective coding that utilizes the complete binary codes in a hypercube. To address this problem, we propose an adaptive binary quantization (ABQ) method that learns a discriminative hash function with prototypes associated with small unique binary codes. Our alternating optimization adaptively discovers the prototype set and the code set of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes, and enjoys the fast training linear to the number of the training data. We further devise a distributed framework for the large-scale learning, which can significantly speed up the training of ABQ in the distributed environment that has been widely deployed in many areas nowadays. The extensive experiments on four large- scale (up to 80 million) data sets demonstrate that our method significantly outperforms state-of-the-art hashing methods, with up to 58.84% performance gains relatively. Index Terms— Locality-sensitive hashing, nearest neighbor search, binary quantization, distributed learning, product quantization. I. I NTRODUCTION T HE past decades have witnessed the success of the hash- ing technique for fast nearest neighbor search in many applications, such as large-scale visual search [1]–[7], machine learning [8]–[11], data mining [12], etc. The hashing methods basically encode data into the (binary) hash codes, which turns the nearest neighbor search in the original complex space to Manuscript received November 5, 2016; revised June 1, 2017; accepted July 2, 2017. Date of publication July 24, 2017; date of current version August 21, 2017. This work was supported in part by the National Natural Science Foundation of China under Grant 61370125 and Grant 61402026, in part by the Beijing Municipal Science and Technology Commission under Grant Z171100000117022, in part by the Foundation of State Key Labora- tory of Software Development Environment under Grant SKLSDE-2016ZX- 04, and in part by the Foundation of Shaanxi Key Industrial Innovation Chain under Grant 2017ZDCXL-GY-05-04-02. The associate editor coordi- nating the review of this manuscript and approving it for publication was Prof. Aydin Alatan. (Corresponding author: Cheng Deng.) X. Liu and Z. Li are with the State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China (e-mail: [email protected]). C. Deng is with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]). D. Tao is with the School of Information Technologies, The University of Sydney, NSW 2007, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2017.2729896 that in the Hamming space. Therefore, the storage required for the gigantic database will be significantly compressed, and while searching over the database can be completed in an fast way based on the codes. To guarantee the search performance, intuitively the hashing methods should assign the nearest neighbors to the adjacent hash codes, i.e., preserving the neighbor relations among data. To achieve this goal, Locality-Sensitive Hashing (LSH) was first introduced in [13] as the most essential concept in hashing based nearest neighbor search with the probability guarantee. The pioneer LSH research adopted a random projection par- adigm for the metrics like l p -norm ( p (0, 2]) [14]. Due to its simple form and efficient computation, such a projection based solution has become the most widely accepted hashing paradigm, where the data point is first projected along certain direction and further quantized to a binary value. LSH randomly generates the projection vectors without considering the intrinsic relationships among the data, and thus it usually suffers from the heavy redundancy among the hash bits. This fact inevitably leads to the insufficient discriminative power for nearest neighbors. To utilized the intrinsic distri- bution information in the data, many following studies have attempted to learn the projection based hash functions from the training data using a number of different techniques including supervised learning [15]–[18], nonlinear mapping [19]–[21], discrete optimization [22]–[26], structural information embed- ding [4], [27]–[29], multiple view fusion [2], [30]–[32], com- plementary table learning [33], [34], etc. Although projection based hashing has been proved bene- ficial to the generation of discriminative hash codes in many tasks [19], [23], [24], [27], [35]–[42], state-of-the-art methods still cannot well approximate the nearest neighbor relations using their binary codes, mainly due to the fact that the linear form hardly characterizes the complex inherent structures among data [43]. To alleviate the problem, the nonlinear mapping techniques, based on kernel [20], [44]–[46] or deep learning [47], [48], are further exploited to uplift the data into an informative space. These techniques have been proved very helpful to generate discriminative hash coeds. However, they are usually time-consuming and hardly discover the under- lying data structures using the binary quantization. Besides, the deep learning based hashing methods, relying on the specific structure in the data (e.g., the spatial structure in image data), usually work well in end-to-end tasks like image search [47], [48], rather than the basic nearest neighbor search for vectorial data. As the most well-known technique that can well model the complex relationships among data, clustering has been proved 1057-7149 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

Distributed Adaptive Binary Quantization forFast Nearest Neighbor Search

Xianglong Liu, Member, IEEE, Zhujin Li, Cheng Deng, Member, IEEE, and Dacheng Tao, Fellow, IEEE

Abstract— Hashing has been proved an attractive techniquefor fast nearest neighbor search over big data. Compared withthe projection based hashing methods, prototype-based onesown stronger power to generate discriminative binary codesfor the data with complex intrinsic structure. However, existingprototype-based methods, such as spherical hashing and K-meanshashing, still suffer from the ineffective coding that utilizes thecomplete binary codes in a hypercube. To address this problem,we propose an adaptive binary quantization (ABQ) method thatlearns a discriminative hash function with prototypes associatedwith small unique binary codes. Our alternating optimizationadaptively discovers the prototype set and the code set ofa varying size in an efficient way, which together robustlyapproximate the data relations. Our method can be naturallygeneralized to the product space for long hash codes, and enjoysthe fast training linear to the number of the training data.We further devise a distributed framework for the large-scalelearning, which can significantly speed up the training of ABQin the distributed environment that has been widely deployed inmany areas nowadays. The extensive experiments on four large-scale (up to 80 million) data sets demonstrate that our methodsignificantly outperforms state-of-the-art hashing methods, withup to 58.84% performance gains relatively.

Index Terms— Locality-sensitive hashing, nearest neighborsearch, binary quantization, distributed learning, productquantization.

I. INTRODUCTION

THE past decades have witnessed the success of the hash-ing technique for fast nearest neighbor search in many

applications, such as large-scale visual search [1]–[7], machinelearning [8]–[11], data mining [12], etc. The hashing methodsbasically encode data into the (binary) hash codes, which turnsthe nearest neighbor search in the original complex space to

Manuscript received November 5, 2016; revised June 1, 2017; acceptedJuly 2, 2017. Date of publication July 24, 2017; date of current versionAugust 21, 2017. This work was supported in part by the National NaturalScience Foundation of China under Grant 61370125 and Grant 61402026,in part by the Beijing Municipal Science and Technology Commission underGrant Z171100000117022, in part by the Foundation of State Key Labora-tory of Software Development Environment under Grant SKLSDE-2016ZX-04, and in part by the Foundation of Shaanxi Key Industrial InnovationChain under Grant 2017ZDCXL-GY-05-04-02. The associate editor coordi-nating the review of this manuscript and approving it for publication wasProf. Aydin Alatan. (Corresponding author: Cheng Deng.)

X. Liu and Z. Li are with the State Key Laboratory of SoftwareDevelopment Environment, Beihang University, Beijing 100191, China(e-mail: [email protected]).

C. Deng is with the School of Electronic Engineering, Xidian University,Xi’an 710071, China (e-mail: [email protected]).

D. Tao is with the School of Information Technologies, The University ofSydney, NSW 2007, Australia (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2017.2729896

that in the Hamming space. Therefore, the storage requiredfor the gigantic database will be significantly compressed, andwhile searching over the database can be completed in an fastway based on the codes.

To guarantee the search performance, intuitively the hashingmethods should assign the nearest neighbors to the adjacenthash codes, i.e., preserving the neighbor relations among data.To achieve this goal, Locality-Sensitive Hashing (LSH) wasfirst introduced in [13] as the most essential concept in hashingbased nearest neighbor search with the probability guarantee.The pioneer LSH research adopted a random projection par-adigm for the metrics like l p-norm (p ∈ (0, 2]) [14]. Due toits simple form and efficient computation, such a projectionbased solution has become the most widely accepted hashingparadigm, where the data point is first projected along certaindirection and further quantized to a binary value.

LSH randomly generates the projection vectors withoutconsidering the intrinsic relationships among the data, and thusit usually suffers from the heavy redundancy among the hashbits. This fact inevitably leads to the insufficient discriminativepower for nearest neighbors. To utilized the intrinsic distri-bution information in the data, many following studies haveattempted to learn the projection based hash functions from thetraining data using a number of different techniques includingsupervised learning [15]–[18], nonlinear mapping [19]–[21],discrete optimization [22]–[26], structural information embed-ding [4], [27]–[29], multiple view fusion [2], [30]–[32], com-plementary table learning [33], [34], etc.

Although projection based hashing has been proved bene-ficial to the generation of discriminative hash codes in manytasks [19], [23], [24], [27], [35]–[42], state-of-the-art methodsstill cannot well approximate the nearest neighbor relationsusing their binary codes, mainly due to the fact that the linearform hardly characterizes the complex inherent structuresamong data [43]. To alleviate the problem, the nonlinearmapping techniques, based on kernel [20], [44]–[46] or deeplearning [47], [48], are further exploited to uplift the data intoan informative space. These techniques have been proved veryhelpful to generate discriminative hash coeds. However, theyare usually time-consuming and hardly discover the under-lying data structures using the binary quantization. Besides,the deep learning based hashing methods, relying on thespecific structure in the data (e.g., the spatial structure inimage data), usually work well in end-to-end tasks like imagesearch [47], [48], rather than the basic nearest neighbor searchfor vectorial data.

As the most well-known technique that can well model thecomplex relationships among data, clustering has been proved

1057-7149 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5325

Fig. 1. The geometric view of the binary quantization using different methods on a subset of SIFT-1M (projected into 3-dimensional space using PCA). Thequantization loss is computed according to (3). (a) ITQ. (b) KMH. (c) ABQ (ours).

a powerful quantization solution using a number of proto-types. The recent hashing studies have attempted to employthe clustering in the binary quantization, including sphericalhashing (SPH) [49] and K-means hashing (KMH) [43]. Thesehashing methods explicitly pursued a number of prototypesto approximate the data relations, and adopted different cod-ing schemes to quantize the data samples based on theseprototypes. In SPH, each bit value of the binary code isgenerated separately using each cluster-based hash function,without directly utilizing the affinity structure of all clusters inthe original feature space. KMH attempted to simultaneouslydiscover clusters in the original feature space and assign themdistinct binary codes preserving their distances in Hammingspace. Different from projection based hashing where eachhash function is parameterized by the projection vectors, theseprototype based hashing methods define the hash functionsbased on the discovered prototypes, and promisingly increasethe search performance with much less quantization loss.

Existing prototype based hashing methods like KMH makeuse of the complete binary code set, which geometricallyforms a hypercube with a fixed dimension and structure amongthe codes, i.e., 2b vertices for b-length hash code and constantdistances among them. In practice, the real-world data usuallydistribute with a complex structure, which can be hardlycharacterized by such a hypercube with a fixed structure.A demonstrative case on a subset of SIFT-1M dataset ispresented in Figure 1, where KMH as the typical prototypebased hashing outperforms the state-of-the-art projection basedITQ (see Figure 1 (a) and (b)), however using 3-bit codes theHamming distances among the prototypes (marked by starswith different colors) cannot well approximate the originalones (the hypercube is skewed in Figure 1(b)).

In fact, a better coding solution only relying on a small setof the vertices in the hypercube (instead of the complete one)can largely reduce the quantization loss by aligning them withthe true distribution of the data samples (see Figure 1 (c)).This is because the incomplete coding can match with thedata distribution in a more reasonable way, and thus betterapproximate neighbor relations. Motivated by this observation,an adaptive binary quantization (ABQ) method is proposed topursue a discriminative hash function with varying numberof prototypes, each of which is associated with a uniqueand compact binary code. Furthermore, a distributed learning

framework for ABQ is designed to support the fast trainingover the large-scale dataset in the distributed environment.In the proposed ABQ, the prototype set and codes are jointlydiscovered to respectively characterize the data distributionin original space and align the code space to the prototypedistribution. Therefore, the learnt prototype based hash func-tion can promise discriminative binary codes that can largelyapproximate the neighbor structures. We further apply productquantization to generalizing our method for long hash codes.Experimental results over four large-scale datasets demonstratethat the proposed method significantly outperforms state-of-the-art hashing methods.

Note that the whole paper extends upon a previous confer-ence publication [50] which is mainly concentrated on central-ized learning in a single node. In this paper, we introduce thepowerful learning framework that can simultaneously exploitboth the distributed and the multi-thread parallel techniquesfor fast training. Besides, we give the amplified analysis andexperimental results. The remaining sections are organized asfollows: The prototype based binary quantization approach ispresent in Section II. Section III elaborates on the alternatingoptimization of the proposed ABQ. Then in Section IV wepresent the distributed learning framework that can boostthe training significantly. In Section V we evaluate the pro-posed method with state-of-the-art methods over several largedatasets. Finally, we conclude in Section VI.

II. PROTOTYPE BASED BINARY QUANTIZATION

This section will present our proposed adaptive binaryquantization method (denoted as ABQ hereafter) in details.

A. Hash Function With Prototypes

In particular, supposing we have a set of n training samples,we denote xi ∈ R

d×1, i = 1 . . . n to be the i -th sampleof d dimension. Let X = [x1, x2, . . . , xn] ∈ R

d×n be the datamatrix. Our goal is to learn an informative binary hash functionthat encodes the training data X into b-length hash codesY = [y1, y2, . . . , yn] ∈ {−1, 1}b×n , which show sensitivityto neighbor structures of the data.

The literature has proved that the prototype-based represen-tation shows robustness to the more general metric structurefor the data in high dimensional space. Therefore, in order

Page 3: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5326 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

to capture the neighbor structure, our desired hash functionshould fully utilized the representative prototypes, and thegenerated hash codes should preserve the relations among theprototypes. To meet this goal, one simple, yet powerful wayis to assign each prototype a unique binary code, maintainingthe geometric structure among the prototypes.

Following this idea, in our work a set of prototypes P ={pk |pk ∈ R

d} are discovered first among the large training set,and then each prototype is associated with a b bit binary codeck ∈ {−1,+1}b, forming a binary codebook C. Namely, ourhash function h(x) based on the prototypes can be defined asfollows:

h(x) = ci∗(x). (1)

The hash function actually works as follows: for any datapoint x, it is first represented by its nearest prototype pi∗(x)

according to the specific distance function do(·, ·):i∗(x) = arg min

kdo(x, pk), (2)

and then encoded by the code ci∗(x) associated with pi∗(x).Using the small set of representative prototypes can reduce

the computation and introduce sparsity without using thefull dataset in binary quantization step. Meanwhile, they cancapture the discriminative essence of the dataset with thesensitivity to metric structure and the robustness to overfit-ting [51]. In traditional hashing methods, usually a series ofhash functions are pursued, each of which generates a hashbit, forming a long hash code. Therefore, these methods haveto append additional constraints to reduce the redundancyamong these individual bits, which usually degenerates theperformance with unreasonable assumptions. Quite differentfrom them, our prototype based function can exploit thecomplex data structure and jointly generate a number of hashbits at the same time.

B. Space Alignment

Encoding the data into the binary codes indicates that thesamples are grouped and constrained on the vertices of ahypercube with constant affinities between them. However,in practice it rarely happens that the data geometricallydistribute in such a perfect structure. Therefore, an optimalbinary coding strategy is highly required to jointly find thediscriminative prototypes and their associated binary codes,which respectively characterize the inherent data relations andmaintain the affinities between samples in Hamming space.

Intuitively, the prototype based hash function h should pre-serve relations between any two samples xi and x j using theirbinary codes. A straightforward way is to concentrate on theconsistence between their original distance and the Hammingdistance. Subsequently the binary coding in Hamming spaceshould be aligned with the data distribution in the originaloriginal space. Formally, we introduce the quantization loss tomeasure the space alignment:

Q(Y, X) = 1

n2

n∑

i, j=1

‖λdo(xi , x j ) − dh(yi , y j )‖2 (3)

where dh(yi , y j ) = 12‖yi − y j‖ is the square root of the

Hamming distance between yi = h(xi ) and y j = h(x j ), andλ is a constant scale parameter for the space alignment.

The loss function involves n2 sample pairs, which preventthe efficient learning over a large training set. As we mentionedabove, the representative prototypes have been proved tobe helpful to substantially reduce the computation in manyapplications. Therefore, for xi the distance from x j can beapproximated as follows:

do(xi , x j ) ≈ do(xi , pi∗(x j )). (4)

Motivated by the fact that the hash code of each sam-ple xi is actually equivalent to that of its nearest prototypes,i.e., yi = ci∗(xi ), the loss function can be further rewritten ina simple and efficient way with respect to the prototypes Pand their binary codes C

Q(P, C, i∗(X)) =n∑

i=1

|P |∑

k=1

wk

n2 ‖λdo(xi , pk) − dh(ci∗(xi ), ck)‖2,

where wk is the number of samples represented by pk .Note that the above approximation actually corresponds

to the widely-used asymmetric distance in quantizationresearch [52], where the database samples are substituted bytheir prototypes. The literature has shown that such asymmet-ric approximation usually owns great power to alleviate thequantization loss. Minimizing the above loss leads to a set ofprototypes that well capture the intrinsic neighbor structureamong the data, and thus a discriminative coding solution thatconsistently preserves the original relations in Hamming space.This means that minimizing the loss actually enforces thoseclose samples in the original space to be clustered in the samegroup represented by one prototype, and meanwhile their hashcodes to maintain the distribution in Hamming space, whichtogether align the neighbor structures between the two spaces.

Therefore, we formulate the hashing problem in terms ofthe space alignment as follows:

minP,C,i∗(X)

Q(P, C, i∗(X))

s.t. ck ∈ {−1, 1}b; cTk cl �= b, l �= k. (5)

Here, the constraints on the binary codebook C will guaranteethat each prototype is assigned a unique binary code.

It should be pointed out that here the number ofprototypes or the size of the codebook isn’t fixedbeforehand, which is quite different from prior hashingresearch like [43] and [49] where all possible binary codes(i.e., 2b using b bits) are assumed to be used in the binaryquantization. Indeed, we adaptively decide the number in theoptimization according to the data metric structure. To someextent, this strategy will avoid the rigorous and difficultalignment between the prototypes and the hypercube binarycodes, and thus faithfully helps discover more consistent anddiscriminative prototypes and the corresponding codebook.

By solving the problem, the prototype set P capturing theoverall data distribution can be obtained. Each prototype willbe associated with a distinct binary code from the codebook C,which together serve as a hash function that encodes those

Page 4: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5327

points belonging to the prototype using the correspondingbinary code.

III. ALTERNATING OPTIMIZATION

There are several variables involved in the above problem,whose optimal values can be hardly obtained partially due tothe discrete constraint. To solve the problem with respect to asmall b, we present an alternating optimization solution, whichpursues the near-optimal prototypes and adaptively determinethe corresponding binary codes in an efficient way. For theefficiency, usually we choose a small b (e.g., b ≤ 8), and laterwe will discuss how to obtain a much longer hash code.

A. Short Code Optimization

1) Adaptive Coding (Updating C): When we have theprototypes and the assignment index for each sample (see theinitialization in Section III-C), the problem turns to a discrim-inative binary coding that consistently keeps the distributionof the samples in the original space. To solve the similarproblem, the most related work k-means hashing presents anencouraging solution to capturing the cluster structure [43].However, its discriminative power is still limited due to therequirement of full hypercube structure, which is beyond thetrue data distribution in practice.

Quite different from the prior research, we adopt an adaptivecoding that directly finds the binary codes most consistentwith the prototypes. Specifically, we will sequentially find alocally optimal binary code for each prototype in a greedyway. Supposing the prototypes p1, . . . , pl (1 ≤ l ≤ |P |) havebeen respectively assigned the binary codes c1, . . . , cl , next weattempt to select the optimal code ck for prototype pk fromthe remaining hash code set C = {−1, 1}b −{c1, . . . , cl}. Thenfor ck , the objective function in (5) turns to

minck∈C

i∗(xi )=k

k′ �=k

wk′ ‖λdo(xi , pk′ ) − dh(ck, ck′ )‖2

+∑

i∗(xi ) �=k

wk‖λdo(xi , pk) − dh(ci∗(x), ck)‖2. (6)

Because the code space is quite limited (|C| ≤ 2b), the aboveoptimal code pursuit can be completed efficiently usingexhaustive search over C.

In the above greedy solution, we can simply choose anybinary code as the optimal c1 for p1 in the initial step.This is because the code space is highly symmetric with ahypercube structure. After repeating |P | steps, we can assigneach prototype a unique hash code with the minimal loss, andmeanwhile greedily keep the original data distribution. Theprototypes are associated with distinct binary codes from thecodebook C, which actually corresponds to the vertex subsetin the hypercube of b dimension.

2) Prototype Update (Updating P): After the binary code-book C is discovered, the prototypes P should be furthercalibrated to simultaneously capture the data distribution andalign it to the geometric structure in the code space. Therefore,the problem with respect to P can be rewritten as:

minP

n∑

i=1

|C|∑

k=1

wk‖λdo(xi , pk) − dh(ci∗(xi ), ck)‖2. (7)

Fig. 2. The convergence trend of adaptive binary quantization, in terms of# prototypes, the quantization loss and MAP with respect to the number ofiterations, in one subspace (b = 8) on GIST-1M using 32 bits. (a) # prototypes.(b) Loss. (c) MAP.

The prototype discovery involves the assignment vari-able i∗(xi ), and thus we adopt a two-step optimization: firstdetermining which prototype the samples belong to, andthen updating the position of each prototype based on theassignment.

Deriving from (7), the prototype that yields the least lossfor each sample xi can be found using a simple search:

mink′≤|C|

|C|∑

k=1

wk‖λdo(xi , pk) − dh(ck′ , ck)‖2. (8)

With the assignment of each sample, we approximatelyrecalculate the position of each prototype:

pk = 1

wk

i∗(xi )=k

xi , 1 ≤ k ≤ |C|. (9)

In this step the number of the prototypes varies, i.e., P isshrunk, where the uninformative prototypes are eliminated.This is the most different part from the previous research. Sub-sequently, the prototype set can gradually adapt the binarycodes to the data distribution in the alternating optimization,and the binary codebook size in the next alternating round willbe also updated.

3) Distribution Update (Updating i∗(X)): After the pro-totype set P is updated, the data distribution with respectto P , characterized by the variable i∗(X), will change slightly.Since the binary coding should maximally preserve the datadistribution, we further append an assignment updating stepto capture the distribution variation, which can be easily doneby employing a similar step like k-means:

i∗(xi ) = arg mink≤|P |

do(xi , pk). (10)

This is consistent with the hash function definition in (2), guar-anteeing that the hash function can discriminatively preservethe intrinsic data relations based on the prototypes.

Algorithm 1 lists the main steps of the proposed ABQmethod. To illustrate its effectiveness, Figure 2 respectivelyshows the trends of the prototype number, the quantizationloss and the precision with respect to the number of iterations.These subfigures together demonstrate how the performancebenefits from the adaptive prototype set. Namely, ABQ canadaptively find the proper number of prototypes and thecorresponding subset of the binary codes in the hypercube(Figure 1(a)), which can maximally capturing the neighbor

Page 5: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5328 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

Algorithm 1 Adaptive Binary Quantization (ABQ)

relations among data with the significantly reduced quanti-zation loss (Figure 1(b)) and the improved performance ofnearest neighbor search (Figure 1(c)).

B. Long Code via Product QuantizationIn practice, usually a long hash code is required for a desired

level of performance in many applications. However, till nowthe proposed algorithm can only generate small codes withb ≤ 8, because the prototype number usually ranges fromtens to hundreds at most for the representation power andthe computational efficiency. Fortunately, our algorithm can benaturally generalized to product space for longer hash codesusing product quantization (PQ) [43], [52]. In order to generatea sufficient long code of b∗ b length, the PQ techniquedivides the original space into M = b∗/b subspaces, in whicha small code of b = b∗/M length is respectively associatedwith each sample and concatenated as a long one in a Cartesianproduct manner.

Specifically, a vector x is represented as M sub-vectors inthe way x = [x(1), x(2), . . . , x(M)]T , where x(m) ∈ R

d×1 is them-th sub-vector of x, and its hash code y(m) ∈ {−1, 1}b×1

can be generated using the proposed adaptive quantizationbased on the sub-prototypes p(m) ∈ R

d×1 and the sub-codebook c(m) ∈ {−1, 1}b×1. The hash code y for vector xis the concatenation of the sub-codes of its sub-vectors:y = [y(1), y(2), . . . , y(M)].

In each subspace, the learnt codes can approximate the orig-inal distance do well using the Hamming based distance dh .If the original distance do is defined as the Euclidean distance,using the product quantization it is also easy to see that theHamming distances between the concatenated codes can stillapproximate the original ones [43].

Recall that Equation (4) corresponds to the asymmetricdistance computation (ADC) in PQ. If the original distancedo is defined as Euclidean distance, PQ can approximate thedistance between two vectors using codewords (prototypes):

do(xi , x j ) ≈ do(xi , pi∗(x j ))

=√√√√

M∑

m=1

do(x(m)i , p(m)

i∗(x(m)j )

)2 (11)

In each subspace, the learnt codes can approximate theoriginal distance do well using the Hamming based distancedh , i.e., λdo(x

(m)i , p(m)

k ) ≈ dh(y(m)i , c(m)

k ). Then, with thedefinition of the distance dh , we have:

λdo(xi , pk) ≈√√√√

M∑

m=1

1

4‖y(m)

i − c(m)k ‖2

= 1

2‖yi − ck‖ = dh(ci∗(xi ), ck) (12)

Putting (11) and (12) together, it is easy to show that the orig-inal distance between any two samples can be approximatedby the Hamming based distance between their hash codes inthe Cartesian space:

λdo(xi , x j ) ≈ dh(ci∗(xi ), ci∗(x j )) (13)

Note that the above approximation requires that the scaleparameter λ remains the same across all subspaces, whichholds roughly in practice over many datasets. Therefore,we set it to the average of the values computed accordingto Equation (14) in all subspaces.

Prior research has pointed out that equally splitting thespace into M parts might result in ineffective hash codes,due to the unbalanced information distribution [43]. Usuallyindependent subspaces are pursued to balance the informationamong the small codes of each sample. Therefore, in the spacedecomposition, we apply the eigenvalue allocation methodto evenly distribute the variance using PCA projection with-out dimension reduction [53]. One can also further appendan adaptive bit allocation to maximally capture the datainformation using different number (or code length) of hashbits [29].

C. InitializationIn the alternating optimization, we should first initialize the

assignment indices i∗(X) and the prototype set P . In practice,we fix the size of P to 2b at first, and then generate theprototypes P using the cluster centers of classical k-meansalgorithm on the training data X, where each sample isassigned to its nearest prototype. Although the quality ofthe prototypes depends on the seed selection in the k-meansinitialization phase, we found that they do not affect the overallperformance much. This is mainly because the positions andthe quantity of the prototypes will be refined gradually in theiterative optimization to align the data distribution to the codespace, and even with a coarse initialization, one can still obtainthe informative prototypes in a number of iterations. Besides,since it has minor effects on the performance according to ourempirical observation, we randomly select the order in whichthe prototypes are processed when updating the codebook.

The scale parameter λ in Equation (3) is intuitively adoptedto make the distances comparable between the original andHamming space. Since it usually insensitive to the binarycoding process, we simply set it to a constant based on theinitialization using k-means, assuming that all 2b prototypesare assigned different binary codes:

λ =12b

∑ck ,cl∈{−1,1}b dh(ck, cl)

1n

∑ni=1

∑2b

k=1 do(xi , pk). (14)

Page 6: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5329

TABLE I

TIME CONSUMPTION OF THE PROPOSED ABQ

D. Computation Complexity

At the training stage, to learn the hash functions that cangenerate binary codes of b∗ length, we need to compute thesmall codes in M = b∗/b independent subspaces. For eachsubspace, there are maximally 2b prototypes in d/M featurespace. The adaptive coding greedily finds the locally optimalcode for each prototype over {−1, 1}b in O(nd22b) time.The prototype and distribution update steps require at mostO(nd22b) time to compute the distances between trainingsamples and prototypes. Therefore, when using t (usuallyt ≤ 20) iterations in the alternating optimization, totallyO(22bndt) time is spent on the training. Since the code spaceis quite limited for each subspace (b ≤ 8), the term 22b canbe treated as a constant. Therefore, it can be considered thatthe training time scales linearly with respect to the size ofthe training set. Table I lists the detailed time complexity oftraining stage.

When it comes to the online search, for each query pointthe hash function needs O(2bd) time to find the nearestprototype and O(1) time for the code assignment, which islinear to the feature dimension d as most projection basedhashing methods like LSH [14] and ITQ [35]. Furthermore,our method only utilizes a small subset (e.g., a quarter) ofcodes, which directly reduce the time consumption at thestage of hash code generation. Therefore, compared withother prototype based methods like KMH, our method usuallyowns faster speed when performing online search in practice(see Table II).

IV. DISTRIBUTED LEARNING FRAMEWORK

For data-dependent hashing algorithms, with the increasingnumber of training samples, more information of nearestneighbor structure can be obtained to improve the retrievalperformance. Figure(3) illustrates this effect when using ABQon TINY-80M dataset. We respectively plot the recall curvesof ABQ using 10K and 1M training samples. It is obviousthat the recall performance is significantly improved whenusing a large-scale training set. The result indicates that anefficient learning algorithm is highly required for the large-scale problem.

Nowadays, many practical systems naturally store andprocess the massive data across a number of computingnodes in the distributed manner. Subsequently, the distributedlearning has become a common and promising solution to thelarge-scale learning problems, which can perform the trainingover the distributed data in an efficient way based on thedistributed parallel computation. However, most of existinghashing algorithms employ the centralized manner, which can

Fig. 3. Recall performance when learning 32 hash bits over different number(10K and 1M) of training data from TINY-80M.

Fig. 4. An illustration of the distributed and parallel optimization framework,where each node processes the training subset locally and each thread in eachnode processes the divided subspace using PQ.

hardly learn hash functions or codes directly from the largetraining dataset due to the expensive computational and mem-ory cost. In this part, we will propose a distributed learningframework for our ABQ. Note that the proposed distributedABQ (D-ABQ) can achieve fast computation by fully utilizingthe distributed learning over multiple training subsets andsimultaneously incorporating multi-thread technique to the PQin each subspaces individually. In Figure 4 we illustrate theflexible and efficient learning framework.

A. Distributed Optimization

The proposed ABQ hashing method can be naturallyadapted to a distributed learning framework, speeding upthe training significantly and while consuming a very littlecommunication cost.

Supposing the whole training data X is distributed across Vcomputing nodes: X = [X{1}, X{2}, . . . , X{V }], where X{v} isthe data stored at v-th node. The proposed distributed ABQmethod attempts to learn the hash function by minimizing thesame objective function as ABQ does, but using V nodes.As the previous section states that minimization involves theupdate of the prototype set P and the corresponding code-book C in an alternating manner, the distributed ABQ sharesthe same optimizing procedures, and thus the time complexityof the three-step updating in the v-th node is also linear tothe size of the training set X{v}. However, different from thebasic ABQ, the distributed learning process needs synchronizeP and C across the nodes in the distributed network, and thus

Page 7: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5330 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

TABLE II

HASHING PERFORMANCE AND TIME EFFICIENCY ON SIFT-1M AND GIST-1M

brings a little additional computational cost on the networkcommunication. Next, we will introduce the main steps of thedistributed learning in details.

1) Update C: This step corresponds to the Adaptive Codingstep in ABQ. For each prototype pk , we have to compute theloss with respect to the candidate binary code ck accordingto Equation (6). Since the computation involves the full setof the training data, it cannot be completed locally on eachnode. Fortunately, the problem can be naturally decomposedinto the following sub-problems:

minck∈C

L(ck) =V∑

v=1

L{v}(ck) (15)

where L{v}(ck) is the local loss value computed bythe v-th node:

L{v}(ck) =∑

i∗(xi )=k,

xi ∈X{v}

k′ �=k

wk′ ‖λdo(xi , pk′ ) − dh(ck, ck′ )‖2

+∑

i∗(xi ) �=k,

xi ∈X{v}

wk‖λdo(xi , pk) − dh(ci∗(x), ck)‖2 (16)

Based on the decomposition, we can complete this step ina map-reduce computing way. Namely, we can first computethe intermediate value L{v} in each node respectively usingits training data, and then gather them to generate the finalvalue. Once the optimal ck is found, it will be broadcasted toall nodes. In this way, the codebook C can be maintained oneach node and updated consistently across all nodes.

2) Update P: This updating corresponds to the PrototypeUpdate step in ABQ. Similar to the C sub-problem, to recal-culate the position of each prototype, we can decompose theupdating operation in (9) according to the following equation:

pk =V∑

v=1

p{v}k (17)

where p{v}k can be locally evaluated in the v-th node as the

centralized way:

p{v}k =

i∗(xi )=k,

xi ∈X{v}

xi (18)

The above computation also follows the standard map-reduce

model, and can be easily completed, since each p{v}k can be

computed independently.After we have each updated pk , like the procedure of

updating C, the whole prototype set P can be maintained andupdated by only broadcasting the result pk over the network.

3) Update i∗(X): This part will reassign each trainingsample to its nearest prototypes. Since in the last step all thenodes have obtained the whole prototype set, each i∗(xi ) canbe updated locally and independently in the node storing it,without any information exchanged among the nodes. Afterthe re-assignment, the weights w = [w1, w2, . . . , w|P |]T forthe prototypes should be shared across the nodes for thenext iteration. This can be completed by adopting similarsteps as updating P , and the cost is negligible compared tothe P and C.

As to the initialization, there exists many distributed k-means algorithms that can be used to initialize the prototypeset P efficiently.

B. Computation and Communication Complexity

In general, in each iteration the data processing can beuniformly described as a standard map-reduce computingprocess: (1) the mappers independently compute the interme-diate information I{1},I{2}, . . . ,I{V } (including L{v}, p{v}

k andwk) on each node in parallel, (2) the reducers aggregate theintermediate values to a single node and compute the finalresult R, and (3) finally it broadcasts the result R to all nodesfor next iteration.

The communication process is demonstrated in Figure 5.In our distributed ABQ, the mapper task is computation-intensive and meanwhile data-intensive, since each mapper

Page 8: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5331

Fig. 5. The distributed learning framework with an uniform computing andcommunicating model, where the intermediate information I{v} is computedby each node and the result R is broadcasted across the network.

only has to digesting the data locally distributed in eachnode. This makes the implementation quite simple and thecomputation very fast in practice. Moreover, according to theabove analysis, there are totally three parts in the alternatingoptimization that involve communication across the nodes,mainly included in the steps of updating C and updating P . Forsimplicity, here we assume that all the nodes have the equalnumber of training data, i.e., X{v} ∈ R

d× nV � for any v-th node.

When updating C, all the local losses L{v} are first gatheredin the reducer node, at the communication cost of O(V 2b),and then all the optimal ck are broadcasted to other nodes inO(V b∗2b). When it comes to the updating of P , the reducernode collects the local prototypes p{v}

k in O(V d2b), and thensimilar to the procedure of updating C, the whole prototypeset P is broadcasted to all nodes also in O(V d2b).

Table I lists the primary computation and communicationcomplexity of distributed ABQ in detail, compared to the timecomplexity of original ABQ. From the table we can drawthe following conclusions: In the distributed ABQ, (1) Thecomputation load is evenly distributed in each node, whichguarantees the efficiency of the distributed computing; (2) Thecommunication complexity is much smaller compared withcomputation complexity, and it is independent to the size ofthe training set. More importantly, (3) the total time that thedistributed ABQ spends on n training samples is much lessthan ABQ. Since the distributed learning conducts the sametraining process as ABQ does, we can see that the distributedABQ algorithm can significantly boost the efficiency of theproposed quantization solution in distributed settings, withoutany loss of the performance. Note that using the multi-threadbased parallel technique, the proposed method can be furtherspeeded up, owning to the independent PQ decomposition.

In the literature, DistH has been first proposed as the oneof very few distributed hashing methods [54]. Both methodshave linear time complexity in computation with respect todata size, and the communication costs are both small andindependent to the training size. However, DistH attemptsto learn ITQ hash functions in the distributed environmentusing an approximate solution, which cannot guarantee thesame performance to (usually less than) the original ITQ.On the contrary, our distributed ABQ is naturally extendedfrom the original ABQ method with lossless sub-problemdecomposition, and thus undoubtedly performs better thanboth DistH and ITQ.

V. EXPERIMENTS

This section will evaluate the proposed adaptive binaryquantization (ABQ) in the task of the large-scale nearestneighbor search. ABQ is compared with a number of state-of-the-art hashing algorithms covering the two main types ofhashing research: the projection based ones including LocalitySensitive Hashing (LSH) [14], Spectral Hashing (SH) [19],Kernelized Locality Sensitive Hashing (KLSH) [46], AnchorGraph Hashing (AGH) [20], Iterative Quantization (ITQ) [35]and Kronecker Binary Embedding (KBE) [55], and the pro-totype based ones including Spherical Hashing (SPH) [49]and K-Means Hashing (KMH) [43] as the most representativemethods.

• LSH randomly generates Gaussian projection vectors forthe cosine similarity.

• SH formulates the binary coding problem as a spectralembedding in the Hamming space, and generalizes theapproximated solution for out-of-sample extension.

• KLSH constructs randomized locality-sensitive functionswith arbitrary kernel functions. We feed it the GaussianRBF kernel and 300 support samples.

• AGH approximates the intrinsic structure underlying thedata based on anchors, and generates hash codes basedon the anchor representation.

• ITQ iteratively finds the data rotation in a subspace tominimize the binary quantization error.

• KBE generates linear hash function with a structuredmatrix, which can achieve fast hash coding over high-dimensional data. We adopt the optimized version ofKronecker projection.

• SPH iteratively adjusts the spherical planes to generateindependent and balanced partitions, which serve as thenonlinear hash functions based on the distances to thecenters. In SPH, each partition can generate a hash bitindependently.

• KMH generates affinity affine clusters using k-means inthe partitioned subspaces of the training features, andmaps each cluster to a binary hash code for the out-of-sample coding.

Note that since the distributed ABQ (D-ABQ) conducts theexactly same optimizing steps to ABQ, their performance willbe identical over the same training dataset with the samesetting. In the following experiments, the nearest neighborsearch performance is reported for both ABQ and D-ABQ,unless otherwise specified.

A. Evaluation Protocols

To comprehensively evaluate the proposed method, we firstemploy two well-known large-scale datasets SIFT-1M (1M)and GIST-1M (1M) [52]. The two datasets respectivelycontain one million 128-D SIFT and 960-D GIST descrip-tors, each of which complies with a separate query subset.We respectively construct a training set of 10,000 random sam-ples and a testing set of 1,000 random queries on both datasets.Besides, we employ another two much larger datasets SIFT-20M (20M) [52] and Tiny-80M (80M) [56], respectively con-sisting of 20 million 128-D SIFT and 80 million 384-D GIST

Page 9: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5332 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

Fig. 6. Recall performance of different hashing methods on SIFT-1M and GIST-1M. (a) 64 bits on SIFT-1M. (b) 128 bits on SIFT-1M. (c) 64 bits onGIST-1M. (d) 128 bits on GIST-1M.

features. We randomly sample 50,000 and 100,000 points asthe training sets, and 3,000 queries as the testing ones. As tothe groundtruth of each query, we select the 1,000 Euclideannearest neighbors among the database on SIFT-1M, GIST-1Mand SIFT-20M, and 5,000 on Tiny-80M.

We adopt two common search schemes to evaluate thehashing performance, i.e., Hamming distance ranking and hashtable lookup. The former ranks all candidates based on theHamming distances from the query, and the later treats pointsfalling within a small Hamming radius r (r ≤ 2) from thequery code as the retrieved results. As to KMH and our ABQwith product quantization, we set b = 4 for SIFT featureswhen using less than 64 bits, and b = 8 for all other cases.In each experiment, we run 10 times in a workstation with2.53 GHz Xeon CPU and report the averaged performance.

B. Results and Discussions1) Euclidean Nearest Neighbor Search: We first evaluate

all hashing methods in the task of Euclidean nearest neighborsearch over SIFT-1M and GIST-1M. Table II lists the meanaverage precision (MAP) using Hamming distance rankingwith respect to different number of hash bits. From thetable we can observe that all methods increase their MAPperformance when using more hash bits from 32 to 128 bits.Moreover, methods like ITQ, SPH, KMH and our (D-)ABQ,which encode the data from the view of clustering quantiza-tion, consistently achieve much better performance than othermethods like LSH, SH and AGH. This indicates that it isa promising way to discover a particular quantization strat-egy for binary hashing. Among all these methods, (D-)ABQobtains the best performance, and gets significant performancegains over the best competitors, e.g., using 128 bits, 24.41%over ITQ on SIFT-1M, and 39.76% over SPH on GIST-1M.

Figure 6 further plots the recall curves with respect todifferent number of retrieved results on both datasets, wherewe can get the same conclusion that (D-)ABQ performs bestin all cases. The main reason is that compared to the baselineswhere the codebook is fixed, (D-)ABQ can adaptively generatethe codebook of a varying size and well match the binary codesto the prototypes. For the performance of Hamming distanceranking, we compare our (D-)ABQ with the baseline methodsin terms of precision, besides recall and MAP performancein the paper. Figure 6 also plots the precision curves withrespect to different cutting points of the retrieved result lists

on SIFT-1M and GIST-1M, where we vary the number of hashbits from 64 to 128. We can see that our (D-)ABQ consistentlyperforms best with significant performance superiority toothers.

Besides Hamming distance ranking, hash table lookup isanother common search strategy over the hash codes. In thiscase, usually a small code (e.g., 32 bits for one million data) isused to avoid the memory and time consumption derived fromthe exponentially huge amount of indexing buckets. Table IIfurther reports the precision within a small Hamming radiusr = 1 and r = 2 (PH1 and PH2 for short). This is also apopular evaluation metric in practice, because with a smalllookup radius, nearest neighbor search can be efficiently com-pleted by only locating data falling in buckets with Hammingdistance less than the radius from the query. Similarly, it iseasy to see that (D-)ABQ outperforms the baselines with alarge margin, e.g., 15.91% and 58.84% PH1 gains over KMHrespectively on SIFT-1M and GIST-1M. Compared to SPHand KMH that also exploit the prototype based hash functions,the encouraging precision gains obtained by (D-)ABQ indicatethat our method can approximate the neighbor relations muchbetter by encoding the data using a subset of binary codes inHamming space. This intuition is also visually demonstratedin Figure 1 using a subset of SIFT-1M.

2) Nearest Neighbor Search Over Large Datasets: To inves-tigate the performance of different hashing methods over morelarge-scale dataset, we adopt two of the largest datasets to-date: SIFT-20M and Tiny-80M. Table III reports the precisionperformance in terms of Hamming distance ranking and hashtable lookup. Here, due to the facts that in practice usersare more concerned about the top ranked results, and whilecomputing the MAP of the full Hamming distance rankinglist is quite time-consuming [20], [24], we present the averageprecision of top 1,000 returned samples (P@1,000) instead ofMAP with respect to the varying code length (32, 64 and 128).Similar to the results in Table II, in all cases our (D-)ABQconsistently obtains the best precision, especially on Tiny-80Mdataset with remarkable superiority, e.g., up to 45.87% perfor-mance gain over the best competitor SPH. As to hash tablelookup, Table III also lists the PH1 performance using 32 bitshash table, from which we can get a similar observation that(D-)ABQ shows a better capability of capturing the neighborstructures, and thus covers much more nearest neighbors thanall baselines.

Page 10: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5333

TABLE III

HASHING PERFORMANCE ON SIFT-20M AND TINY-80M

Fig. 7. Recall performance of different hashing methods on SIFT-20M and Tiny-80M. (a) 64 bits on SIFT-20M. (b) 128 bits on SIFT-20M. (c) 64 bits onTiny-80M. (d) 128 bits on Tiny-80M.

Figure 7 respectively depicts the recall curves using64 and 128 bits on SIFT-20M and Tiny-80M. Comparedto all the baselines, our ABQ boosts the recall with themost significant improvement when using more hash bits,and consistently performs best in all cases. On the real-worlddataset Tiny-80M, this observation can be more obvious asshown in Figure 7(c) and (d), where the recall of the top104 result list increases largely from 14.61% to 22.93% withmore hash bits, and meanwhile the best performance amongall baselines is 17.90% achieved by SPH using 128 hash bits.This fact further demonstrates that our (D-)ABQ can faithfullyboost the overall hashing performance in terms of precisionand recall, using Hamming distance ranking or hash tablelookup. As to the precision performance, since the groundtruthnumber is fixed to a constant number, a similar observationcan be obtained as recall performance, which means that theproposed method can obtain the best recall and meanwhile thebest precision performance on the two much larger datasetsSIFT-20M and Tiny-80M.

C. Efficiency Issue

Figure 2 shows that the proposed ABQ can converge fastin less than 10 iterations. Therefore, in practice, the algo-rithm can achieve efficient training and support the large-scale learning. This is consistent with our complexity analysisin Section III-D, e.g., the training time scales linearly to thesize of the training set.

Table II further lists the offline training time and onlinesearch time when using 128 hash bits on SIFT-1M andGIST-1M in a single node. We can see that usually the iterativebinary quantization methods like ITQ, SPH, KMH and ourABQ take more training time than the others. This is mainly

due to the difficulty of finding an optimal coding solution thatcan align the Hamming space with the original one. Amongthese methods, our ABQ costs much less time than KMH,while gives the best performance with a little more trainingtime than SPH and ITQ. Moreover, at the online search stage,only a small set of prototypes (smaller than 2b, b ≤ 8 in eachsubspace) will be checked, and thus the hashing time is veryclose to the prior projection based methods. Namely, it cansupport the real-time nearest neighbor search as the existingmethods do.

D. Distributed Speedup

To evaluate the scalability of D-ABQ, we conduct theexperiments on a computer cluster consisting of up to 8 nodes,each with a 2.53 GHz Xeon CPU. The training data is evenlydistributed across them. In all experiments we learn 32-bithash codes on Tiny-80M dataset using D-ABQ.

We first adopt “Speedup” to comprehensively study theperformance of D-ABQ. “Speedup” is a commonly usedperformance metric to measure the efficiency improvementwhen using more nodes for the same task. Here we define“Speedup” with respect to the node number V as follows:

Speedup(V ) = training time on one node

training time on V nodes, (19)

while keeping the size n of training dataset fixed.Figure 8(a) depicts the speedup curves with respect to

different number of nodes, using 10K and 1M random samplesfor training, respectively. It can be observed that the Speedupperformance increases monotonically (almost linearly) withthe number of nodes increasing. This indicates that morenodes can balance the computation and thus reduce the totaltraining time. Besides, although the two curves share the same

Page 11: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5334 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

Fig. 8. The tranining time with respect to different number of nodes anddifferent size of the training dataset. (a) Speedup. (b) Sizeup.

increasing trend with respect to the node number, we find thatusing larger training set (e.g., 1M) can achieve a higher growthrate, which is mainly because that the communication costbecomes relatively incomparable to the computation time witha much larger training dataset (see the complexity analysis inSection IV-B). This further confirms that the proposed ABQequipped with distributed learning can perform well on large-scale training dataset.

To further exploit the relationship between the time con-sumption and the number of the training data, we also evaluatethe “Sizeup” performance with respect to the size of thetraining set. The term “Sizeup” measures how much time ittakes when the size n of the training dataset becomes m timeslarger:

Sizeup(m) = training time over m × n samples

training time over n samples(20)

while fixing the number of nodes.Figure 8(b) shows the “Sizeup” curves with different num-

ber of nodes and the size n initially set to 10K. We can seethat, in all cases the training time scales linearly with respectto the number of the training samples, which is consistentwith our analysis of the time complexity in Section III-D.Moreover, we can observe that when training over the datasetof the same size, more nodes involved in the distributedlearning give slower growth rate of the “Sizeup” performance.This observation is similar to the that in Figure Figure 8(a),i.e., using more nodes will benefit the algorithm efficiencymore significantly, and is more suitable for the large-scalelearning.

E. Effect of Groundtruth Number

Prior research has pointed out that the number ofgroundtruth may have effects on the performance [43].Therefore, to illustrate the robustness of our method withrespect to the groundtruth number nn, we further conductthe experiments on SIFT-1M and GIST-1M by varying nn in{10, 100, 1000}. In Figure 9 we compare the recall perfor-mance of (D-)ABQ using 128 bits to the two state-of-the-artmethods ITQ and KMH, which serve as the representativesof projection and prototype-based methods, achieving the bestperformance among all baselines as shown in prior experi-ments. In this figure, we vary the groundtruth number nn ondifferent datasets.

Fig. 9. Recall performance of different hashing methods with respect todifferent number of groundtruth on SIFT-1M and GIST-1M. (a) recall onSIFT-1M. (b) recall on GIST-1M.

As we can see from the figure, when using more nearestneighbors as the groundtruth (nn varies from 10 to 1000),all methods decrease the recall performances. This is becausethat as the distance between the database point and queryincreases, the collision probability between them will decrease.Nevertheless, for different settings, our ABQ consistentlyachieves the best performance and significantly outperformsothers in all cases. For instance, with nn = 100 on GIST-1M,ABQ can achieve much higher recall than ITQ and KMH,and even better than them with nn = 10. This means thatour method is very robust to the task of nearest neighborsearch. Besides, in all these experiments we adopt the sameparameter settings, which indicates that the proposed ABQ ispractical without complex parameter tuning.

VI. CONCLUSION

Inspired by our observation that in prototype based hashingthere might exist a better coding solution that only utilizes asmall subset of binary codes instead of the complete set, thispaper proposed an adaptive binary quantization method thatjointly pursues a set of prototypes in the original space and asubset of binary codes in the Hamming space. The prototypesand the codes are correspondingly associated and togetherdefine the hash function for small hash codes. To speed upthe training and support the popular distributed storage inthe clusters nowadays, we further equipped our method witha distributed learning framework. Our method enjoys fasttraining and the capability of generating long hash codes ofhigh quality. The significant performance gains over existingmethods were obtained in our extensive experiments on sev-eral large datasets, which encourage us to further study theeffective coding for binary quantization.

REFERENCES

[1] J. He et al., “Mobile product search with bag of hash bits and boundaryreranking,” in Proc. IEEE CVPR, Jun. 2012, pp. 3005–3012.

[2] X. Liu, J. He, and B. Lang, “Multiple feature kernel hashing for large-scale visual search,” Pattern Recognit., vol. 47, no. 2, pp. 748–757,Feb. 2014.

[3] J. Song, Y. Yang, X. Li, Z. Huang, and Y. Yang, “Robust hashing withlocal models for approximate similarity search,” IEEE Trans. Cybern.,vol. 44, no. 7, pp. 1225–1236, Jul. 2014.

[4] Q. Wang, Z. Zhang, and L. Si, “Ranking preserving hashing for fastsimilarity search,” in Proc. IJCAI, 2015, pp. 3911–3917.

[5] L. Liu and L. Shao, “Sequential compact code learning for unsupervisedimage hashing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 12,pp. 2526–2536, Dec. 2015.

Page 12: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

LIU et al.: DISTRIBUTED ABQ FOR FAST NEAREST NEIGHBOR SEARCH 5335

[6] X. Liu, L. Huang, C. Deng, B. Lang, and D. Tao, “Query-adaptivehash code ranking for large-scale multi-view visual search,” IEEE Trans.Image Process., vol. 25, no. 10, pp. 4514–4524, Oct. 2016.

[7] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A surveyon learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., to bepublished, doi: 10.1109/TPAMI.2017.2699960.

[8] P. Jain, S. Vijayanarasimhan, and K. Grauman, “Hashing hyperplanequeries to near points with applications to large-scale active learning,”in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 928–936.

[9] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Compacthyperplane hashing with bilinear functions,” in Proc. ICML, 2012,pp. 1–2.

[10] Y. Mu, G. Hua, W. Fan, and S.-F. Chang, “Hash-SVM: Scalable kernelmachines for large-scale visual classification,” in Proc. IEEE CVPR,Jun. 2014, pp. 979–986.

[11] X. Liu, X. Fan, C. Deng, Z. Li, H. Su, and D. Tao, “Multilinearhyperplane hashing,” in Proc. IEEE CVPR, Jun. 2016, pp. 1–9.

[12] X. Liu, J. He, C. Deng, and B. Lang, “Collaborative hashing,” in Proc.IEEE CVPR, Jun. 2014, pp. 2147–2154.

[13] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towardsremoving the curse of dimensionality,” in Proc. ACM STOC, 1998,pp. 604–613.

[14] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitivehashing scheme based on p-stable distributions,” in Proc. SCG, 2004,pp. 253–262.

[15] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing forimage retrieval via image representation learning,” in Proc. AAAI, 2014,pp. 2156–2162.

[16] G. Lin, C. Shen, and A. V. D. Hengel, “Supervised hashing using graphcuts and boosted decision trees,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 37, no. 11, pp. 2317–2331, Nov. 2015.

[17] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network forefficient similarity retrieval,” in Proc. AAAI, 2016, pp. 2415–2421.

[18] W.-C. Kang, W.-J. Li, and Z.-H. Zhou, “Column sampling based discretesupervised hashing,” in Proc. AAAI, 2016, pp. 2604–2623.

[19] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. NIPS,2008, pp. 1–8.

[20] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” inProc. ICML, 2011, pp. 1–8.

[21] F. Shen, C. Shen, Q. Shi, A. V. D. Hengel, Z. Tang, and H. T. Shen,“Hashing on nonlinear manifolds,” IEEE Trans. Image Process., vol. 24,no. 6, pp. 1839–1851, Jun. 2015.

[22] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quantization-based binary codes for fast similarity search,” in Proc. NIPS, 2012,pp. 1205–1213.

[23] M. Norouzi and D. J. Fleet, “Cartesian K-means,” in Proc. IEEE CVPR,Jun. 2013, pp. 2938–2945.

[24] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, “Discrete graph hashing,”in Proc. NIPS, 2014, pp. 3419–3427.

[25] D. Song, W. Liu, and D. A. Meyer, “Coordinate discrete optimiza-tion for efficient cross-view image retrieval,” in Proc. IJCAI, 2016,pp. 2018–2024.

[26] X. Liu, B. Du, C. Deng, M. Liu, and B. Lang, “Structure sensitivehashing with adaptive product quantization,” IEEE Trans. Cybern.,vol. 46, no. 10, pp. 2252–2264, Oct. 2016.

[27] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang, “Circulant binary embed-ding,” in Proc. ICML, 2014, pp. 946–954.

[28] Y. Mu, W. Liu, C. Deng, Z. Lv, and X. Gao, “Fast structural binarycoding,” in Proc. IJCAI, 2016, pp. 1860–1866.

[29] X. Liu, Y. Mu, D. Zhang, B. Lang, and X. Li, “Large-scale unsupervisedhashing with shared structure learning,” IEEE Trans. Cybern., vol. 45,no. 9, pp. 1811–1822, Sep. 2015.

[30] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiplefeature hashing for large-scale near-duplicate video retrieval,” IEEETrans. Multimedia, vol. 15, no. 8, pp. 1997–2008, Dec. 2013.

[31] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficientimage search,” IEEE Trans. Image Process., vol. 24, no. 3, pp. 956–966,Mar. 2015.

[32] Y. Wang, X. Lin, L. Wu, W. Zhang, and Q. Zhang, “LBMCH: Learningbridging mapping for cross-modal hashing,” in Proc. ACM SIGIR, 2015,pp. 999–1002.

[33] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementaryhashing for approximate nearest neighbor search,” in Proc. IEEE ICCV,Nov. 2011, pp. 1631–1638.

[34] X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Query-adaptive reciprocalhash tables for nearest neighbor search,” IEEE Trans. Image Process.,vol. 25, no. 2, pp. 907–919, Feb. 2016.

[35] Y. Gong and S. Lazebnik, “Iterative quantization: A procrusteanapproach to learning binary codes,” in Proc. IEEE CVPR, Jun. 2011,pp. 817–824.

[36] W. Kong and W.-J. Li, “Isotropic hashing,” in Proc. NIPS, 2012, pp. 1–8.[37] Z. Jin et al., “Complementary projection hashing,” in Proc. IEEE ICCV,

Jun. 2013, pp. 257–264.[38] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. Dick, “Learning hash

functions using column generation,” in Proc. ICML, 2013, pp. 142–150.[39] L.-K. Huang, Q. Yang, and W.-S. Zheng, “Online hashing,” in Proc.

IJCAI, 2013, pp. 1422–1428.[40] B. Wu, Q. Yang, W.-S. Zheng, Y. Wang, and J. Wang, “Quantized

correlation hashing for fast cross-modal search,” in Proc. IJCAI, 2015,pp. 3946–3952.

[41] Q.-Y. Jiang and W.-J. Li, “Scalable graph hashing with feature transfor-mation,” in Proc. IJCAI, 2015, pp. 2248–2254.

[42] X. Liu, J. He, and S. F. Chang, “Hash bit selection for near-est neighbor search,” IEEE Trans. Image Process., to be published,doi: 10.1109/TIP.2017.2695895.

[43] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preservingquantization method for learning binary compact codes,” in Proc. IEEECVPR, Jun. 2013, pp. 2938–2945.

[44] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes fromshift-invariant kernels,” in Proc. NIPS, 2009, pp. 1–8.

[45] B. Kulis and T. Darrell, “Learning to hash with binary reconstructiveembeddings,” in Proc. NIPS, 2009, pp. 1–8.

[46] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing forscalable image search,” in Proc. IEEE ICCV, Oct. 2009, pp. 2130–2137.

[47] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashingfor compact binary codes learning,” in Proc. IEEE CVPR, Jun. 2015,pp. 2475–2483.

[48] J. Lu, V. E. Liong, and J. Zhou, “Deep hashing for scalable imagesearch,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2352–2367,May 2017.

[49] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Sphericalhashing,” in Proc. IEEE CVPR, Jun. 2012, pp. 2957–2964.

[50] Z. Li, X. Liu, J. Wu, and H. Su, “Adaptive binary quantization for fastnearest neighbor search,” in Proc. ECAI, 2016, pp. 64–72.

[51] D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and randomsubspace for support vector machines-based relevance feedback in imageretrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7,pp. 1088–1099, Jul. 2006.

[52] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearestneighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,pp. 117–128, Jan. 2011.

[53] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 744–755,Apr. 2014.

[54] C. Leng, J. Wu, J. Cheng, X. Zhang, and H. Lu, “Hashing for distributeddata,” in Proc. ICML, 2015, pp. 1642–1650.

[55] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, and S.-F. Chang,“Fast orthogonal projection based on kronecker product,” in Proc. IEEEICCV, Dec. 2015, pp. 2929–2937.

[56] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:A large data set for nonparametric object and scene recognition,” IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970,Nov. 2008.

Xianglong Liu (M’12) received the B.S. and Ph.D.degrees in computer science from Beihang Uni-versity, Beijing, in 2008 and 2014. From 2011 to2012, he visited the Digital Video and MultimediaLab, Columbia University, as a joint Ph.D. Student.He is currently an Associate Professor with BeihangUniversity. His research interests include machinelearning, computer vision, and multimedia informa-tion retrieval.

Page 13: 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. …see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's... · 2018. 9. 12. · 5324 IEEE TRANSACTIONS ON IMAGE PROCESSING,

5336 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 11, NOVEMBER 2017

Cheng Deng (M’10) received the B.E., M.S., andPh.D. degrees in signal and information processingfrom Xidian University, Xi’an, China. He is cur-rently a Full Professor with the School of ElectronicEngineering, Xidian University. He is the authorand co-author of more than 50 scientific articles attop venues, including IEEE TNNLS, TMM, TCYB,TSMC, TIP, ICCV, CVPR, IJCAI, and AAAI. Hisresearch interests include computer vision, multime-dia processing and analysis, and information hiding.

Dacheng Tao (F’15) is currently a Professor ofcomputer science with the Centre for QuantumComputation and Intelligent Systems, Faculty ofEngineering and Information Technology, Universityof Technology, Sydney. He mainly applies statisticsand mathematics to data analytics problems and hisresearch interests spread across computer vision,data science, image processing, machine learning,and video surveillance. His research results haveexpounded in one monograph and over 100 publi-cations at prestigious journals and prominent con-

ferences, such as IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, ICML,CVPR, ICCV, ECCV, AISTATS, and ICDM, and ACM SIGKDD, with severalbest paper awards, such as the best theory/algorithm paper runner up awardin IEEE ICDM’07, the best student paper award in IEEE ICDM’13, and the2014 ICDM 10 Year Highest Impact Paper Award.