losha: a general framework for scalable locality …jcheng/papers/losha_sigir17.pdflutions....

10
LoSHa: A General Framework for Scalable Locality Sensitive Hashing Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, Ruihao Zhao Department of Computer Science and Engineering The Chinese University of Hong Kong {jfli,jcheng,fyang,yzhuang,yjzhao,xyan,rhzhao}@cse.cuhk.edu.hk ABSTRACT Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH so- lutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that re- duces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa’s API. We evaluate LoSHa and also compare with general-purpose plat- forms on the same LSH algorithms. Our results show that LoSHa’s performance can be an order of magnitude faster, while the imple- mentations on LoSHa are even more intuitive and require few lines of code. 1 INTRODUCTION Nearest neighbor (NN ) search, which finds items similar to a given query item, serves as the foundation of numerous applications such as entity resolution, de-duplication, sequence matching, recom- mendation, similar item retrieval, and event detection. However, many indexing methods for NN search are inefficient when data dimensionality increases, and eventually perform even worse than brute-force linear scans [20]. Unfortunately, many types of data in real world, including texts, images, audios, DNA sequences, etc., are expressed and processed as high dimensional vectors. To find similar items in high dimensional space, recent solu- tions focus on finding approximate nearest neighbors (ANN ), for which Locality Sensitive Hashing (LSH 1 )[13] is widely recognized as the most effective method [20, 31]. The idea of LSH is that, by 1 In this paper, LSH refers a broad class of algorithms [14, 32] and will be discussed in Section 2. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5022-8/17/08. . . $15.00 https://doi.org/http://dx.doi.org/10.1145/3077136.3080800 using carefully designed hash functions, similar items are hashed to the same bucket with higher probability and one only needs to check a small number of buckets for items similar to a query. LSH can be implemented on relational databases and has sub- linear querying time complexity regardless of the distribution of data/queries [31]. Due to its outstanding query performance and favorable characteristics, LSH algorithms have been extensively studied [7, 8, 12, 14, 19, 31, 32, 36]. There are also numerous appli- cations of LSH in computer vision, machine learning, data mining and database. Due to its significance, LSH is cited as one of the two pieces of “Breakthrough Research” (the other one being MapRe- duce) by the 50th Anniversary issue of Communications of the ACM. As we will discuss in Section 2.2.1, LSH algorithms are either single-probing or multi-probing. Query processing with single-probing LSH is simple but requires hundreds of hash tables in order to achieve good accuracy [30], which results in high memory and CPU usage. LSH algorithms developed in the early days, such as MinHash [2], SimHash [3] and E2LSH [5], belong to this category. In recent years, multi-probing LSH algorithms were proposed to reduce the number of hash tables [7, 12, 20, 24, 29, 31, 36], but the tradeoff is that query processing also becomes more complicated. LSH is usually used to handle large-scale data (e.g., billions of images [29] or tweets [30]), for which distributed LSH has become necessary (to address the resource limitation of a single machine). However, implementing distributed multi-probing LSH is challeng- ing due to the following two main reasons. First, query processing with multi-probing LSH is complicated to be distributed. In fact, existing multi-probing LSH algorithms mainly adopt external-memory implementations [7, 12, 20, 29, 31, 36], which however have limited query throughput and long delay due to the computing capacity of a single machine. Second, there is a lack of a general programming framework to implement LSH algorithms on existing general-purpose distributed frameworks such as Hadoop and Spark. While there are works such as Pregel [25] and PowerGraph [10] provide a general pro- gramming framework for graph processing [23], and parameter servers [15] for machine learning, there is no such framework for LSH implementations. For this reason, most of the distributed solu- tions [1, 18, 21, 22, 27, 28, 30] are single-probing LSH only, whose query processing is simple and easy to implement. To address the above problems, we set out to design a general plat- form, called LoSHa, for LSH workloads with tailor-made abstrac- tions and user-friendly APIs. LoSHa offers a simple query-answer programming paradigm (resembling map-reduce) so that users can express different types of LSH algorithms (especially multi-probing LSH) easily and concisely, while implementation details such as

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

LoSHa: A General Framework for Scalable Locality SensitiveHashing

Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, Ruihao ZhaoDepartment of Computer Science and Engineering

The Chinese University of Hong Kong{jfli,jcheng,fyang,yzhuang,yjzhao,xyan,rhzhao}@cse.cuhk.edu.hk

ABSTRACTLocality Sensitive Hashing (LSH) algorithms are widely adoptedto index similar items in high dimensional space for approximatenearest neighbor search. As the volume of real-world datasets keepsgrowing, it has become necessary to develop distributed LSH so-lutions. Implementing a distributed LSH algorithm from scratchrequires high development costs, thus most existing solutions aredeveloped on general-purpose platforms such as Hadoop and Spark.However, we argue that these platforms are both hard to use forprogramming LSH algorithms and inefficient for LSH computation.We propose LoSHa, a distributed computing framework that re-duces the development cost by designing a tailor-made, generalprogramming interface and achieves high efficiency by exploringLSH-specific system implementation and optimizations. We showthat many LSH algorithms can be easily expressed in LoSHa’s API.We evaluate LoSHa and also compare with general-purpose plat-forms on the same LSH algorithms. Our results show that LoSHa’sperformance can be an order of magnitude faster, while the imple-mentations on LoSHa are even more intuitive and require few linesof code.

1 INTRODUCTIONNearest neighbor (NN ) search, which finds items similar to a givenquery item, serves as the foundation of numerous applications suchas entity resolution, de-duplication, sequence matching, recom-mendation, similar item retrieval, and event detection. However,many indexing methods for NN search are inefficient when datadimensionality increases, and eventually perform even worse thanbrute-force linear scans [20]. Unfortunately, many types of data inreal world, including texts, images, audios, DNA sequences, etc.,are expressed and processed as high dimensional vectors.

To find similar items in high dimensional space, recent solu-tions focus on finding approximate nearest neighbors (ANN ), forwhich Locality Sensitive Hashing (LSH1) [13] is widely recognizedas the most effective method [20, 31]. The idea of LSH is that, by

1In this paper, LSH refers a broad class of algorithms [14, 32] and will be discussed inSection 2.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-5022-8/17/08. . . $15.00https://doi.org/http://dx.doi.org/10.1145/3077136.3080800

using carefully designed hash functions, similar items are hashedto the same bucket with higher probability and one only needsto check a small number of buckets for items similar to a query.LSH can be implemented on relational databases and has sub-linear querying time complexity regardless of the distribution ofdata/queries [31]. Due to its outstanding query performance andfavorable characteristics, LSH algorithms have been extensivelystudied [7, 8, 12, 14, 19, 31, 32, 36]. There are also numerous appli-cations of LSH in computer vision, machine learning, data miningand database. Due to its significance, LSH is cited as one of the twopieces of “Breakthrough Research” (the other one being MapRe-duce) by the 50th Anniversary issue of Communications of theACM.

As we will discuss in Section 2.2.1, LSH algorithms are eithersingle-probing ormulti-probing. Query processingwith single-probingLSH is simple but requires hundreds of hash tables in order toachieve good accuracy [30], which results in high memory andCPU usage. LSH algorithms developed in the early days, such asMinHash [2], SimHash [3] and E2LSH [5], belong to this category.In recent years, multi-probing LSH algorithms were proposed toreduce the number of hash tables [7, 12, 20, 24, 29, 31, 36], but thetradeoff is that query processing also becomes more complicated.

LSH is usually used to handle large-scale data (e.g., billions ofimages [29] or tweets [30]), for which distributed LSH has becomenecessary (to address the resource limitation of a single machine).However, implementing distributed multi-probing LSH is challeng-ing due to the following two main reasons.

First, query processing with multi-probing LSH is complicatedto be distributed. In fact, existing multi-probing LSH algorithmsmainly adopt external-memory implementations [7, 12, 20, 29, 31,36], which however have limited query throughput and long delaydue to the computing capacity of a single machine.

Second, there is a lack of a general programming framework toimplement LSH algorithms on existing general-purpose distributedframeworks such as Hadoop and Spark. While there are workssuch as Pregel [25] and PowerGraph [10] provide a general pro-gramming framework for graph processing [23], and parameterservers [15] for machine learning, there is no such framework forLSH implementations. For this reason, most of the distributed solu-tions [1, 18, 21, 22, 27, 28, 30] are single-probing LSH only, whosequery processing is simple and easy to implement.

To address the above problems, we set out to design a general plat-form, called LoSHa, for LSH workloads with tailor-made abstrac-tions and user-friendly APIs. LoSHa offers a simple query-answerprogramming paradigm (resembling map-reduce) so that users canexpress different types of LSH algorithms (especially multi-probingLSH) easily and concisely, while implementation details such as

Page 2: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

defining complicated dataflow, choosing/implementing differentoperators (e.g., joins, de-duplication), workload distribution, andfault tolerance, are all shielded from users.

We implemented LoSHa as a subsystem upon a general-purposeplatform (i.e., Husky [16, 33, 34]) to demonstrate its feasibility. Weexplored LSH specific data layout and system optimizations, andevaluated LoSHa on large real datasets, including 1 billion tweetsand 1 billion image descriptors. Our results show that LoSHa is overan order of magnitude faster than existing LSH implementations [1,28] on Spark and Hadoop, and can handle 10 times larger datasetswith the same computing resources.

2 BACKGROUNDIn this section, we give the background of LSH and highlight thechallenges of making LSH applications distributed.

2.1 Notion and Notations of LSHIndyk and Motwani [13] introduced the notion of LSH as follows.

Definition 2.1. Let B (q, r ) denote a ball centered at a point qwith radius r . A hash function family H = {h : Rd → U } is(r1, r2,p1,p2)-sensitive, if given any q,o ∈ Rd ,

(1) if o ∈ B (q, r1), then PrH [h(o) = h(q)] ≥ p1;(2) if o < B (q, r2), then PrH [h(o) = h(q)] ≤ p2.

A useful LSH family should satisfy inequality r1 < r2 andp1 > p2.For any d-dimensional point o ∈ Rd , a single hash function h ∈ Hcan hash o to a value v ∈ U . Two similar points with distanceless than or equal to r1 have a high probability of at least p1 to behashed to the same value, while two dissimilar points with distancelarger than r2 have a low probability of at most p2 to be hashedto the same value. Some common LSH families include min-wisepermutation for Jaccard distance [2], random projection for angulardistance [3], and 2-stable projection for Euclidean distance [5], alsoknown as MinHash, SimHash, and E2LSH, respectively.

Intuitively, a hash value v ∈ U represents a bucket that con-tains similar items (i.e., points). We denote the probability of twoitems being hashed to the same value as the to-same-bucket prob-ability. To improve the accuracy of LSH, multiple hash tables aregenerated, and multiple hash values are concatenated and usedas a whole for each hash table. Specifically, L ≥ 1 hash tables,i.e., G1,G2, . . . ,GL , are generated. For each hash table, k ≥ 1 ran-domly chosen hash functions, (h1,h2, . . . ,hk ), are used, and thek hash values of an item o are concatenated to form a signature,G (o) = (h1 (o),h2 (o), . . . ,hk (o)). In other words, L signatures, i.e.,G1 (o),G2 (o), . . . ,GL (o), are obtained for the item o. The purposeof using k hash functions and L hash tables is to enlarge the gap ofto-same-bucket probability between similar items and dissimilaritems [27].

LSH is typically conducted in two phases. The first phase ispre-processing, which inserts data items into the hash tables. LSHprojects each d-dimensional data item into a k-dimensional vector,i.e., a k-dimensional signature Gi (o), and inserts the item into thecorresponding bucket in hash table Gi , for 1 ≤ i ≤ L, where Gi (o)is the bucket ID. The second phase is querying. Given a query itemq, we first obtain L bucket IDs for q in the same way as for a dataitem. Then, data items in the same bucket as q (called candidate

answers) in each hash table are evaluated to obtain the final an-swer (e.g., items within a given distance from q). LSH achievessub-linear querying complexity [13, 31] (as only candidate answersare evaluated) and hence is attractive in practice.

2.2 The State-of-the-art LSHWe first categorize existing LSH algorithms and then analyze thedifficulties of implementing them on existing distributed frame-works.

2.2.1 LSHAlgorithms. Querying by LSHwas originally designedto probe a single bucket (i.e., the bucket q is hashed to) in each ofthe L hash tables. We name this querying method as single-probingLSH and typical examples include MinHash [2], SimHash [3] andE2LSH [5]. Single-probing LSH is simple [1, 18, 21, 22, 27, 28], but itsuffers from two weaknesses. First, single-probing LSH is sensitiveto r1 and r2, but the distance from a query to its nearest neighborcould be any number in practice. To guarantee a high recall (i.e.,most of the query answer are found in the set of candidate an-swers), a large number of hash tables have to be generated [24, 31],which consumes a large amount of memory and incurs high de-duplication overhead to remove duplicate candidate answers fromthe L buckets. For example, 780 hash tables are created in PLSH [30].Second, single-probing LSH has poor recall if k is large (e.g., 64 or128) [14, 19]. Thus, it is not applicable to most algorithms [14, 32]that learn hash functions from data and map each item to a hashcode of long bits for better precision [19].

To address the weaknesses of single-probing LSH, many multi-probing LSH algorithms have been proposed [7, 12, 20, 24, 29, 31, 36].The main idea is that, if similar items are not in the same bucket asq, they are likely in other buckets in the hash table that are “nearby”q’s bucket because the hashing is locality sensitive. If we can computenew signatures corresponding to these “nearby” buckets, we canprobe multiple buckets in the same hash table to improve the recall.As a result, multi-probing significantly reduces the number of hashtables [7, 12, 20, 24, 36] and effectively improves the recall when kis large [19, 32].

2.2.2 Distributed LSH. Although there aremany LSH algorithms,no one can dominate others in all aspects (e.g., precision, recall andefficiency) [14, 26, 32]. And one common problem they have is thehandling of large-scale data (e.g., billion-scale tweets or images), forwhich the typical solution is by distributed computing. To reducethe development cost, general-purpose distributed frameworks arenormally used and most of distributed LSH solutions were devel-oped on Hadoop and Spark. However, these distributed solutionsare mostly single-probing LSH [1, 18, 21, 22, 27, 28] and suffer fromthe problems presented in Section 2.2.1. In addition, it is also chal-lenging to implement distributed multi-probing LSH on Hadoopand Spark, due to the following two main factors.

First, to program multi-probing LSH, users need to define acomplicated dataflow consisting of many steps. Each step involvemultiple operations, and users have to carefully select suitableoperators (e.g., which join operator to use) and consider manyother implementation details (e.g., how to avoid unnecessary re-computation by data caching on Spark). It is easy to make a wrongchoice in some steps, leading to an inefficient or even incorrect

2

Page 3: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

q1 q2 q3

Queries

query

mapping

answer

b1

Buckets b2

Items i1 i2 i3

Figure 1: Query processing in LoSHa

implementation, much debugging and performance tuning efforts.To address these problems, we design higher-level abstractions anduser-friendly APIs in Section 3, and demonstrate with examples inSection 4 how to implement multi-probing LSH algorithms.

Second, existing general-purpose platforms are not optimized forLSH. For example, in the LSH implementation on these platforms,queries and items are joined on signatures, which means that Lcopies of items and queries need to be shuffled and sent over thenetwork, leading to high network traffic and enormous memoryconsumption. Furthermore, the execution strategies of their opera-tors (e.g., join) are not best fit for LSH workloads. To give efficientimplementations, users need to do the optimizations by themselves.In Section 5, we explore LSH-specific optimizations so that usersare off-loaded from these difficult tasks.

3 LOSHA PROGRAMMING FRAMEWORKA practical distributed computing framework must provide an easy-to-use programming interface in order to lower the developmentcost. One well-known example is MapReduce [6], which offers asimple programming interface for easy implementation of manydistributed algorithms for data processing. Similarly, for scalableLSH computation, the most important first step is to design a unifiedprogramming model and a simple API. However, this is challengingbecause there are hundreds of LSH algorithms and many of themare complicated to implement, especially parallelizing them. Wepresent the details of the LoSHa programming model in this section.

3.1 Overall FrameworkLoSHa uses three data abstractions, Query, Bucket, and Item, tomodel input queries, buckets, and items, respectively. LoSHa firstpre-processes data items to create a set of Item objects. It alsocreates a set of Bucket objects, where each Bucket object has abucket ID, bId, and all Item objects having a signature equal to bIdbelong to this Bucket object. Queries are also processed similarlyas items to create a set of Query objects.

Query processing in LoSHa requires users to specify the query-ing logic in two user-defined functions, query and answer. Figure 1shows the flow of query processing, described as follows. LoSHaprocesses LSH queries iteratively. In each iteration, each Queryobject calls the query function to find a set of potential buckets thatcontain candidate items, and sends a message QueryMsg to eachpotential bucket, where QueryMsg contains the content of the query(and any other information required to evaluate the query). Whena Bucket object receives a QueryMsg message, it forwards the mes-sage to its Item objects. When an Item object receives a QueryMsgmessage, it calls the answer function to evaluate the query (e.g.,

Listing 1 Query API in LoSHatemplate<typename Id,

typename Content,typename QueryMsg = IdContent<Id, Content>,typename AnswerMsg = std::pair<Id, float>>

class Query:public IdContent<Id, Content>{

public:virtual void query(Factory<Id, Content>& fty);

const IdContent<Id, Content>& getQuery();

const std::vector<AnswerMsg>& getAnswerMsg();

void sendToBucket(Sig& bId, QueryMsg& msg);

void setFinished();};

calculate the distance between the query and itself), and sends theresults (as an AnswerMsg message) back to the Query object. Thenin the next iteration, each Query object decides whether the queryevaluation is completed or should continue based on the resultsreceived from the candidate Item objects.

Iterative processing allows multi-probing LSH to be easily im-plemented; while for conventional single-probing LSH, we simplyrun just one iteration. LoSHa hides the complicated dataflow in theiterative process from users. The underlying system also optimizeskey operations such as join and de-duplication, and automaticallyhandles workload distribution and fault tolerance.

3.2 Application Programming InterfaceIn LoSHa, Query, Bucket and Item objects and their related opera-tions are implemented by the respective Query, Bucket and Itemclasses in LoSHa’s API. To program with LoSHa, users only need tosubclass the Query and Item classes, and overload the query andanswer functions.

The APIs for Query and Item are shown in Listings 1 and 2.Users specify four template arguments, Id, Content, QueryMsgand AnswerMsg. LoSHa expresses a query or an item by an (Id,Content) pair, denoted by IdContent, where Id is the type of theID of the query/item and is used to locate the query/item, andContent is the type of the content of the query/item. LoSHa storesContent in a vector. The i-th element can be either the value of thei-th feature of the query/item (i.e., dense vector), or a (pos,v ) pairrepresenting the value v of the pos-th feature (i.e., sparse vector).The messages sent out by a query and an item are of type QueryMsgand AnswerMsg, respectively. For example, QueryMsg can be thequery itself, i.e., IdContent, or just the query ID. An AnswerMsgmessage sent by an item is usually an (Id, dist) pair, where Idis the item ID and dist is the distance between the query and theitem. A Query or Item object may also contain user-defined fieldsto carry any information useful for query processing.

Query API. Users implement the query function to specifythe operations to be performed by a Query object (see an exam-ple in Listing 3). The Query class provides a number of built-infunctions, and the query function may use getQuery to obtain aquery’s IdContent, use getAnswerMsg to collect messages sent bycandidate items, and use setFinished to finish the evaluation of aquery when the query predicate is satisfied. Signatures of a query

3

Page 4: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

Listing 2 Item API in LoSHatemplate<typename Id,

typename Content,typename QueryMsg = IdContent<Id, Content>,typename AnswerMsg = std::pair<Id, float>>

class Item:public IdContent<Id, Content>{

public:virtual void answer(Factory<Id, Content>& fty);

const IdContent<Id, Content>& getItem();

const std::vector<QueryMsg>& getQueryMsg();

void sendToQuery(Id& qId, AnswerMsg& msg);};

Listing 3 Single-probing LSH in LoSHaclass LSHQuery : public Query<int, float>{public:void query(Factory<int, float>& fty){//obtain buckets (i.e., signatures) to probefor(auto& bId : fty.calSigs(*this))

sendToBucket(bId, this->getQuery());}

};class LSHItem : public Item<int, float>{public:void answer(Factory<int, float>& fty){//get messages to evaluatefor(auto& q: getQueryMsg()){float dist = fty.calDist(q, *this);//output qualified items

}}

};

can be calculated by functions provided in the Factory API (to bepresented shortly). Each signature corresponds to one bucket ID,which is of type Sig (a k dimensional vector in LoSHa by default),and sendToBucket uses the ID to send messages to the bucket.

Item API. Users may use getQueryMsg in the Item class tocollect messages forwarded to the item by Bucket objects, and thenspecifies in the answer function how to process the messages toevaluate the query (e.g., calculate the distance between the itemand the query). The results are then sent via an AnswerMsgmessageto the corresponding Query object by sendToQuery.

Factory API. Many types of metric spaces are defined for differ-ent LSH applications. A different metric space requires a differentdistance calculation and signature computation. The Factory APIallows three user-defined functions to be overloaded. Users can over-load the calDist function to calculate the distance between twopoints in the given metric space, and the calSigs function to com-pute the signatures for a given point. Some LSHs (e.g., E2LSH [5])first project an item to a real value and then round it to an integer toobtain the final hash value for the item. Rounding loses proximityinformation, which is useful for designing some multi-probing LSHstrategy. Thus, we provide another function, calProjs, for users toobtain the un-rounded projected real values of a query or an item.

LoSHa supports a Factory librarywhich includesmany commonmetrics such as Hamming distance, Jaccard distance, Euclideandistance, Angular distance, etc. For a wide range of LSH algorithms,users may simply call functions for their applications using any of

these metrics, and focus on the design of the querying logic. Usersmay also implement their own Factory functions for a new metric.

An example. We illustrate how to program with LoSHa’s APIby implementing a single-probing LSH algorithm, as shown inListing 3. The query function calls calSigs of the Factory classto compute the signatures of a query, i.e., the IDs of the potentialbuckets to be probed, and then calls sendToBucket to send theQueryMsgmessage to each potential bucket. The QueryMsgmessageis just the query itself in this application.

The answer function simply checks each QueryMsg messageforwarded to an item by its bucket, and calls the calDist functionof the Factory class to calculate the distance between the item andthe query in QueryMsg. The qualified answers are then written tothe output.

The single-probing LSH implementation is simple. For multi-probing LSH, we only need to specify the multi-probing strategy inthe query function, which we discuss in the following subsection,while in the answer function we simply use sendToQuery to sendthe results back to the query (instead of writing to the output).

3.3 Multi-probing LSH StrategiesMany multi-probing LSH algorithms have been proposed for var-ious application scenarios [7, 20, 29, 31]. In order to provide ageneral programming framework for LSH implementations, weidentify common patterns in these algorithms. Specifically, in thefirst iteration, we probe buckets as in single-probing. If the querypredicate is not satisfied (e.g., less than K similar items are found),new signatures (i.e., new bucket IDs) are generated according to amulti-probing strategy, and we continue probing these new poten-tial buckets in the next iteration. To demonstrate our frameworkcan support multiple probings well, we adopt two strategies in theexperiments and present them below.

Candidate expansion using similar items. The main idea ofthis strategy is that similar items are more likely to be clusteredtogether. Assume o is a similar item of q returned in the currentiteration from the l-th hash table, where 1 ≤ l ≤ L. For each o, thequery function computes a set of potential buckets (for the nextiteration) with IDs equal to G1 (o), . . . ,Gl−1 (o),Gl+1 (o), . . . ,GL (o).Since o is returned from the bucket with ID Gl (o) in the l-th hashtable, we do not probe this bucket again to avoid duplicate probing.This strategy is good for radius-based search (i.e., return neighborswithin a specific distance) as shown in Section 6.2.

Adjusted signatures. Many LSH algorithm [19, 20, 24] willfirst hash a point to a real value, then truncate it to an integer.The truncation may lose some proximity information between twopoints. For example, an E2LSH function [5] may hash two pointq and o to values 1.001 and 0.999, which are then truncated to 1and 0, respectively. Although the difference between q and o isonly 0.002, it is enlarged by the truncation to 1. Such loss can bedismissed by adjusting the hash value of q from 1 to 0, and the ideahas been used to design multi-probing LSH strategies in [24]. Weimplement the idea in LoSHa as follows. Let ri (q) be the real valuethat is truncated to give hi (q), where 1 ≤ i ≤ k . We first calculatedi (q) = ri (q) − hi (q), sort di (q) for 1 ≤ i ≤ k in ascending orderof their values, and keep the sorted list in q’s data field. Then, inthe i-th iteration after the first iteration, if the query predicate is

4

Page 5: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

Listing 4 Bucket API in LoSHatemplate<typename Id,

typename Content,typename QueryMsg,typename AnswerMsg>

class Bucket{public:

virtual void mapping(Factory<Id, Content>& fty);

const std::vector<QueryMsg>& getQueryMsg();

void sendToQuery(Id& qId, AnswerMsg& msg);};

not satisfied, let dj (q) be the i-th value in the sorted list, the queryfunction computes a new potential bucket ID by adjusting the j-thhash value in q’s signature. This strategy significantly improvesthe quality of returned answers for k-nearest-neighbor search asreported in Section 6.4.

3.4 Support for Non-Querying ApplicationsSome applications of LSH do not require a querying process. Forexample, for overlapping clustering applied in Google News [4], wecan implement it in MapReduce by first computing the signaturesof the items in mappers and then distributing the items to reducersbased on their L signatures. Each reducer then simply outputsthe items distributed to it, labeling them as in the same cluster.Although LoSHa is a framework designed for query processing, weshow that LoSHa can also handle such a non-querying workloadas easily as MapReduce using the Bucket API given in Listing 4.

We consider each item as a query so that Query acts as Mapperand the query function acts as the map function, while the Item classis simply not used here. We also subclass the Bucket class, whichacts as Reducer, and the mapping function of Bucket acts as thereduce function. In addition, our framework also supports efficientimplementation of iterative algorithms which are expensive whenimplemented on the MapReduce framework.

4 APPLICATIONSIn this section, we illustrate how to implement two advanced

multi-probing LSH algorithms in LoSHa’s framework.

4.1 Multi-Probing PLSH in LoSHaPLSH [30] is the fastest distributed LSH implementation for

angular distance, but it is single-probing LSH and requires a largenumber of hash tables. We implement the LSH algorithm of PLSHwith a multi-probing strategy to reduce the number of hash tables.We denote our implementation as multi-probing PLSH (MPLSH ).

Implementing the query function for multi-probing LSH consistsof three main steps: (1) collecting candidates from the previous itera-tion, (2) deciding whether to finish querying, (3) computing a new setof potential bucket IDs for the next iteration. In Listing 5, the queryfunction performs the above three steps as follows. First, it collectsthe candidate items sent from the previous iteration, if any, intoa container cands. Then, the candidate items in cands are used tocheck whether the query predicate is satisfied (e.g., top K answershave been found), and to decide whether to finish or continue thequery evaluation (note that users should output the query results

Listing 5Multi-probing PLSH in LoSHaclass MPLSHQuery: public Query<int, Content, QueryMsg, int>{

public:virtual void query(Factory<int, Content>& fty){

//collect candidates into candsfor(auto& item : getAnswerMsg() )cands.insert(item);

auto finished = ... //decide whether to stopif(finished) setFinished();//calculate new buckets(signatures) to probeauto newSigs = ... //calculate new signaturesfor(auto& bId : newSigs)sendToBucket(bId, this->getQuery());

}};class MPLSHItem: public Item<int, Content, QueryMsg, int>{

public:void answer(Factory<int, Content> &fty){for(auto& q: getQueryMsg()){float dist = fty.calDist(q, *this);if(dist <= 0.9) sendToQuery(q.id, this->id);

}}

};

before calling setFinished). If the query predicate is not satisfiedyet, a new set of signatures is then calculated (e.g., using a probingstrategy described in Section 3.3) and sendToBucket is called toinitiate probing the potential buckets for the next iteration.

The answer function is much simpler. We only need to calculatethe distance between a candidate item and a query using a standardangular distance calculation function in the Factory class, and callssendToQuery to send the qualified candidates (e.g., items havingan angle smaller than 0.9 from the query) to the query.

4.2 Parallelizing C2LSH with LoSHaThe problem of c-ANN [7] returns a data item o for a query q, suchthat d (o,q) ≤ c · d (o∗,q), where o∗ is the exact NN of q, d (o,q) andd (o∗,q) denote the distance between o and q, and between o∗ andq. Many multi-probing LSH solutions have been proposed for thisproblem, butmost are external-memory algorithms [7, 20, 29, 31, 36].With our framework, we can easily implement a parallel version ofthese multi-probing LSH algorithms, which utilizes the aggregatememory of a cluster instead of resorting to the external memoryof a single machine. We illustrate by converting C2LSH [7], one ofthe most efficient single-core multi-probing external-memory LSHalgorithms, into a distributed LSH algorithm with LoSHa.

C2LSH works in Euclidean distance with E2LSH [5] as the hashfunction family. Only one hash function is used to compute eachsignature, i.e., k = 1, and hence we simply use h(.) for G (.). Thequery function probes bucket h(q) in the first iteration, and thenin the (i + 1)-th iteration (for i > 0), buckets with ID bId satisfy-ing ⌊bId/(ci )⌋ = ⌊h(q)/(ci )⌋ are probed. In the i-th iteration, thequery function calls setFinished if it has collected more than Kcandidates whose distance to q is less than or equal to ci , or morethan (K + βn) candidates are returned, where β is a user-definedpercentage of false positives that are allowed and n is the totalnumber of items. In C2LSH, an item o is considered as a candidateonly if it collides with q in more than θ probed buckets. Thus, inthe answer function, the collision count of an item with each query

5

Page 6: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

is maintained (as the data field of the item). If the count is largerthan θ for a query q, the item will be sent to q.

Readers may refer to [7] for theoretical guarantees. We remarkthat the implementation of C2LSH on LoSHa takes about 200 linesof code, while the code of [7] has over 3,000 lines in order to handlelarge datasets.

5 SYSTEM IMPLEMENTATIONWe now present the implementation details of LoSHa.

5.1 System ArchitectureLoSHa is designed to run on a shared-nothing cluster with a typicalmaster-worker architecture. The master mainly performs progresschecking and coordination on the workers. Each machine in thecluster may run multiple workers, which perform the LSH compu-tation on data.

A LoSHa program proceeds in iterations, where each iterationconsists of three steps: query, mapping, and answer. A worker isprocessed by one thread. While all workers run in parallel, eachworker executes in sequence the query, mapping, and answer func-tions for the sets of queries, buckets, and items that are distributedto the worker. Workers communicate with each other by messagepassing, and each worker maintainsW message buffers, whereWis the number of workers in the cluster. For messages to be sent toworkers in the same machine, they can be referred without actualpayload copy and network communication.

5.2 Data LayoutData items and queries are stored as Item and Query objects inLoSHa. LoSHa generates L signatures for each item, using thecalSigs function of the Factory class. Then, a Bucket object iscreated for each distinct signature (which is the ID of the bucket),and for all Item objects that have this signature, their item IDsare stored with this Bucket object. The Item, Bucket and Queryobjects are partitioned and distributed among workers by randomhashing on their IDs.

LSH computation often requires to deliver messages to the re-ceiving objects by their IDs. This can be implemented by hash join,but it incurs a higher overhead due to random access. We pre-sortobjects in each worker by their IDs, and then sort messages by theirtarget object IDs (if not sorted already) and hence the join requiresonly a linear scan. This allows better data locality and prefetchingto improve performance. However, new queries keep coming in andthere may also be new items inserted. To avoid incurring high costin maintaining the pre-sorted order of the objects, we assign theID of each new Item/Query object in increasing order as follows:worker i assigns the j-th new object with ID (i + j ∗W ). In this way,we can simply append the new object to the end of the pre-sortedarray without re-ordering.

LoSHa’s data layout also allows it to implement an efficienttwo-level join that maps queries first to potential buckets and thento candidate items. In contrast, existing LSH implementations inMapReduce framework require expensive joins on queries anditems, which have higher CPU and memory usage, and also higherdata transmission cost among the workers, as verified by our ex-periments.

𝐺1

5 4 1

<0,0> <0,1>

3 2

𝐺2

3 5 1

<0,1> <1,1>

4 2

merge

𝐺

5 4 1

<0,0> <0,1>

3 2

3 5 1

<1,1>

4 2

𝐺1 𝐺1 𝐺2 𝐺2

group

𝐺

4 5 1

<0,0> <0,1>

2 3

3 5 1

<1,1>

4 2

𝐺1 𝐺1 𝐺2 𝐺2

𝐺

4 1 5

<0,0> <0,1>

2 3

1 3 5

<1,1>

2 4

𝐺1 𝐺1 𝐺2 𝐺2

sort

Figure 2: Bucket compression

5.3 Bucket CompressionEach item generates L signatures, and each signature corresponds toa bucket in a hash table. Suppose that each hash table has B distinctsignatures on average. This requires O (LBk ) space for keeping thebucket objects and thus the memory consumption is high, becauseboth L and B are large. For example, PLSH [30] uses L = 780 hashtables and B = O (2k ) where k = 16. Note that PLSH is designed forSimHash, where each of the k hash values in a signature is boolean,and B can be much larger for other LSH families [5, 7, 12, 24, 29, 31]where each hash value is an integer.

We found that we can actually reduce theO (LBk ) space toO (Bk )space by compressing the buckets with the same signature acrossthe L hash tables. Such a compression is effective based on ourobservation that the size of the union of the sets of signatures ofthe L hash tables is cB, where c is usually a small constant. Thus,we can just use one hash table with cB distinct signatures, and keepall the item IDs with the same signature across the L hash tablesin one single location. As illustrated in Figure 2, we compress thebuckets with the same signature (i.e., < 0, 1 >) in the two hashtables G1 and G2 into a single bucket in G. Then, we are able toapply further optimizations as follows.

For each list of item IDs kept in a bucket, we first group themby the workers where the real items are stored, and then sort eachgroup. For example, if we have two workers and an item with ID xis stored in worker x%2, then the item IDs in Figure 2 are dividedinto two groups (i.e., even and odd) and then sorted. This newbucket structure has the following advantages. First, messages sentto multiple Bucket objects with the same signature are now sentto a single Bucket object. Second, to forward messages to items,we do not need to locate the items by hashing their IDs one byone, as the items are already grouped by their worker IDs and sowe can simply forward the messages to their worker. Third, whentheir worker receives the messages (from multiple workers), weonly need to merge them by the target item IDs which are alreadysorted (thus avoiding sorting during query processing).

5.4 Message ReductionOne major source of messages is the QueryMsg messages sent fromeach Query object to its potential buckets and then forwarded toall the items in each potential bucket. Since the answer functionmay calculate the distance between a query and a candidate item,the QueryMsg message should contain the query itself (i.e., a highdimensional vector) and thus sending QueryMsg messages manytimes through the network is expensive. To address this problem, webroadcast the query (together with other information needed for itsevaluation) to each machine in the cluster, and only send query IDsin QueryMsg. Thus, when a worker executes the answer function

6

Page 7: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

on an item, it simply retrieves the corresponding query from thelocal machine. This significantly reduces the communication costespecially for multi-probing LSH.

Sending query IDs in QueryMsg and item IDs in AnswerMsg canbe still costly since a large number of messages can be createdin each step in an iteration. Combiner is a standard techniqueused to reduce the number of messages [6, 25]. To allow moremessage reduction, we implement a machine-level combiner. Forthe same QueryMsg message that is sent to multiple buckets inmultiple workers, if these workers are in the same physical machine,LoSHa only sends a single QueryMsgmessage (together with the setof bucket IDs) to the machine. After reaching the remote machine, athread is used to deliver the QueryMsgmessage to the correspondingbuckets in each worker in this machine. In addition, all messagessent to the same machine are grouped together and sent in batchesto make better use of network bandwidth.

In the mapping step in each iteration, however, we implement atwo-level combiner because up to L copies of the same QueryMsgmessage can be forwarded to the same item, i.e., the item appearsin all the L buckets to which the QueryMsg message is sent. Wefirst combine the same QueryMsg messages sent to the same itemat the worker level, and then we further combine the messagesfor the same item at the machine level. Finally, we combine thesame QueryMsgmessages that are sent to multiple items in multipleworkers in the same target machine.

5.5 De-duplicationDe-duplication is one of key issues in LSH implementations, asthe same query can be mapped to the same item up to L times. Inaddition, after the query has been evaluated against an item in aniteration, multi-probing may generate new potential bucket IDsin any subsequent iterations such that the same item may appearagain in some of these new potential buckets, meaning that thequery would be evaluated against the item again if de-duplicationis not performed.

De-duplication is mainly done by using a hash set, tree index,or bit vector [30]. In LoSHa, most duplicate copies of the sameQueryMsg message sent to the same item have already been elimi-nated by the two-level combiner described in Section 5.4, and thenumber of multiple copies of the same QueryMsg message sent tothe item from different machines is usually very small. Thus, it isefficient to simply use a hash set to do the de-duplication.

LoSHa also differs from existing work in that we take an flexibleapproach to decide where de-duplication should be handled. Theevaluation of a query against an item (e.g., distance calculation) canbe expensive for some applications (e.g., each data item is a densehigh dimensional vector), while it is cheap for other applications. Ifthe evaluation is expensive, then de-duplication is performed by anItem object: the item inserts the received QueryMsg messages intoa hash set, which removes all duplicate copies. However, keeping ahash set for each item may consume much memory and incur extracost of deleting old queries. Thus, if the evaluation is not expensive,de-duplication is better performed by a Query object: we allowduplicate copies of a query to be evaluated against an item, butinsert duplicate AnswerMsg messages received by the Query objectinto a hash set, so that duplicate qualified items will be removedand not included in the final answer.

Table 1: DatasetsDataset Item# Dim# Entry#SIFT1B 1,000,000,000 128 128

Tweets1B 1,050,000,000 500,000 avg 7.5Glove2.2M 2,196,017 300 300

5.6 Online Processing and Fault HandlingLoSHa handles online queries and fault tolerance as follows.

Online queries. LoSHa is able to accept queries in real time.New queries will be read into a buffer in LoSHa. A greedy strategycan be used here such that each worker processes approximatelythe same number of queries. We adopt the mini-batch model asin PLSH [30], where a worker buffers new queries and starts theirevaluation in the next iteration. Thus, new queries do not have towait until all existing queries finish their evaluation.

Fault tolerance. Fault tolerance in LoSHa is by writing a copyof the item and bucket layout in each machine to HDFS. Usuallywe do infrequent batch-updates and hence we can re-write theentire copy of the item and bucket layout to HDFS each time. Ifthe updates are frequent, we may apply checkpointing periodically.Query evaluation results are not backed up and we simply re-runthe queries since the evaluation cost for a query is low.

6 EXPERIMENTAL EVALUATIONWe evaluated LoSHa, by comparing with specific LSH implemen-tations on existing general-purpose platforms and a highly opti-mized system [30] for SimHash. We also assessed the effective-ness of various optimizations in LoSHa and tested its scalability.LoSHa was implemented in C++ and will be made open source onhttp://www.husky-project.com/.

Setup. We ran the experiments on a cluster of 20 machinesconnected by a Broadcom’s BCM5720 Gigabit Ethernet Controller.Each machine has two 6-core 2.0GHz Intel Xeon E5-2620 processors,48GB DDR3-1,333 RAM, and a 600GB SATA disk (6Gb/s, 10k rpm,64MB cache). CentOS 6.5 with 2.6.32 linux kernel and Java 1.8.0 areinstalled on each machine. We used Hadoop 2.6.0 and Spark 1.6.1.For Hadoop, we set the number of containers to 11 per machine (i.e.,one for each core) and each is allocated 4GB RAM, and leave 1 coreand 4GB RAM for other operations. For Spark, we set 4 workersper machine, and each worker is assigned 3 cores and 11GB RAM.We ran Spark on its standalone mode.For LoSHa, we used 12 coresper machine.

Datasets. We used three real datasets. For dense vectors, weused the largest publicly available dataset for ANN search, denotedby SIFT1B, which contains 1 billion SIFT image descriptors fromhttp://corpus-texmex.irisa.fr/. We also used another dataset, de-noted by Glove2.2M, which contains about 2.2 million vector repre-sentations forwords obtained fromhttp://nlp.stanford.edu/projects/glove/(Common Crawl: 840B tokens, 2.2M vocab, cased, 300d vectors).For sparse vectors, we crawled 1.05 billion tweets from Twitter,denoted by Tweets1B. Table 1 lists the number of items (Item#), thenumber of dimensions of each item vector (Dim#), and the numberof non-empty entries in each vector (Entry#).

For each dataset, we randomly selected 1,000 items as queriesand removed these items from the datasets. Each query finds thetop 10 ANNs of the query (unless otherwise specified).

7

Page 8: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

0

120

240

360

480

1.11.121.141.161.181.21.22

Seco

nd

s

Error ratio

Spark SimHash

LoSHa SimHash

(a) SimHash on Glove2.2M

0

600

1200

1800

2400

1.21.31.41.51.61.7

Seco

nd

s

Error ratio

Hadoop E2LSH

LoSHa E2LSH

(b) E2LSH on SIFT0.1B

Figure 3: Overall running time vs. Error ratio

0

30

60

90

120

10 20 30 40 50 60 70 80 90 100

GB

# of hash tables

Spark SimHash

LoSHa SimHash

(a) SimHash on Glove2.2M

0

150

300

450

600

1 3 5 7 9 11 13 15

GB

# of hash tables

Hadoop E2LSH

LoSHa E2LSH

(b) E2LSH on SIFT0.1B

Figure 4: Network traffic vs. Number of hash tables

Quality measure. Given a query set Q , let o1,o2, . . . ,oK bethe top K ANNs returned by an LSH algorithm and o∗1,o

∗2, ...,o

∗K

be the exact top K NNs. Let d (o,q) denote the distance betweeno and q. Error ratio, ER = (1/( |Q |K ))

∑q∈Q∑Ki=1 (d (oi ,q)/d (o

∗i ,q)),

is widely used in existing work [7, 9, 20, 29, 31, 36] to measure thequality of their algorithm. A smaller error ratio means better qualityof results, and an error ratio of 1 means that the exact results arefound. But for single-probing LSH algorithms, there is no guaranteethat K answers will be found. Thus, for those algorithms, we onlyused the number of answers returned to calculate the error ratio,and removed the query from Q if no answer was returned.

6.1 Comparison with LSH on Spark/HadoopWe first compared LoSHa with specific LSH implementations onSpark andHadoop. On Spark, SoundCloud implemented SimHash [3],which is used in production [28] and the most popular LSH imple-mentation on Spark. We denote it by Spark SimHash and comparedit with our implementation of SimHash on LoSHa, denoted byLoSHa SimHash. On Hadoop, we used the Hadoop E2LSH imple-mentation provided by the authors of [1] and compared it with ourimplementation of E2LSH [5] on LoSHa, denoted by LoSHa E2LSH.Spark SimHash and Hadoop E2LSH have more than 800 and 1500lines of Scala/Java code, respectively, while both LoSHa SimHashand LoSHa E2LSH have about 100 lines of C++ code.

Since both Spark SimHash and Hadoop E2LSH are slow andrequire a lot of computing resources to process large datasets, wecould only test them on smaller datasets. SimHash is designed forangular distance and is used for querying text data, and thus weused the Glove2.2M dataset. We set k = 20 as normally used bySpark SimHash, and varied L from 10 to 100. E2LSH is for Euclideanspace and thus we used it to query the SIFT image descriptor dataset.However, SIFT1B is too large for Hadoop E2LSH and thus we onlyused 0.1 billion items, denoted by SIFT0.1B. We set k = 20 andvaried L from 1 to 15 (note that the running time of Hadoop E2LSHincreases dramatically for larger L).

We report the overall running time of the four implementationsin Figure 3. As we increase L, i.e., the number of hash tables, the run-ning time also increases but the error ratio decreases. However, therunning time of LoSHa increases much slower than Spark SimHashand Hadoop E2LSH. LoSHa SimHash attains an error ratio of 1.11 in22.6 seconds, while Spark SimHash requires 411.6 seconds. LoSHaE2LSH attains an error ratio of 1.26 in 145.3 seconds, while HadoopE2LSH uses 2,005.9 seconds.

The superior performance of LoSHa over the Spark/Hadoopimplementations is mainly brought by our LSH-specific system im-plementation such as fine-grained data access and fast joins enabledby the data layout. In contrast, the Spark/Hadoop implementationsfollow the simple logic that joins queries and items by signatures(as discussed in Section 2.2.2), which incurs high network commu-nication cost. As shown in Figure 4, LoSHa has significantly lessnetwork traffic (i.e., the total volume of messages sent over thenetwork) than Spark/Hadoop, which verify the effectiveness of ourLSH-specific system implementation. Note that Spark or Hadoopimplementations may be made more efficient, but it will requirenon-trivial programming efforts from users. A better way could beto implement LoSHa upon Spark and Haoop, so that users need notworry about LSH-specific optimization details. Implementing theLoSHa framework upon Spark/Haoop, however, is not the objectiveof our work, which we leave as future work.

In Section 6.3, we will also compare LoSHa with the C++ MapRe-duce implementation as we examine in details the effectiveness ofthe optimization techniques in LoSHa.

6.2 Comparison with PLSHThis experiment compared LoSHa with PLSH [30], a highly opti-mized distributed implementation of SimHash [3]. Unfortunately,we could not obtain the code of PLSH to repeat their experiments.We could only try to mimic their experiments on LoSHa. We sum-marize our results in Figure 5.

PLSH achieved a recall of 92% by building 780 hash tables for pro-cessing 1 billion tweets, which requires 6.4TB RAM in 100 machines.If we use SimHash, we can generate at most 55 hash tables withthe total 0.96TB of memory in our cluster, thus giving a poor recallof 27.4% only. Thus, we implemented the multi-probing version ofSimHash (denoted as multi-probing SimHash) using the “Candi-date expansion using similar items” strategy in Section 3.3, whichimproves the recall from 27.4% to 86.8% with only 55 hash tables.We further measured the average querying time of multi-probingSimHash, which is 1.86ms (vs. 1.42ms of PLSH). The result is sig-nificant since LoSHa can process the same-scale data and achieveperformance not much worse than PLSH which uses significantlymore computing resources (i.e., RAM and CPU).

We emphasize that PLSH is a specific system highly optimizedfor SimHash only, while LoSHa is a general framework for imple-menting different types of LSH algorithms and thus it is expectedthat LoSHa has poorer performance than PLSH for SimHash. Infact, it is the multi-probing strategy that helps LoSHa to achievehigh performance using much less resources than PLSH. Therefore,it is necessary for a system to be user-friendly for implementingmulti-probing algorithms. Note that our implementation of multi-probing SimHash has about 100 lines of effective code only, but it isdefinitely non-trivial to implement PLSH [30]. Thus, we believe that

8

Page 9: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

0

0.25

0.5

0.75

1

SimHash Multi-probing SimHash

Rec

all

(a) Recall

0

0.5

1

1.5

2

PLSH LoSHa

Mill

isec

on

ds

(b) Average querying time

Figure 5: Performance of multi-probing LoSHa

Table 2: Effectiveness of optimizations in LoSHa

Glove2.2M Tweets0.21B

Querying Network Querying Networktime traffic time traffic

LoSHa 0.52s 0.140GB 30.21s 20.13GBNo

de-duplication 0.48s 0.143GB 82.38s 74.98GB

No messagereduction 3.09s 4.95GB 282.29s 526.95GB

MapReduceLoSHa 187.61s 355.34GB failed failed

the abstraction, API and LSH-specific implementation of LoSHa canhelp promote multi-probing LSH on existing distributed platforms.

6.3 Effects of Optimization TechniquesWe now examine the effectiveness of the various optimizationsimplemented in LoSHa. To do that, we created four versions ofLoSHa: (1) LoSHa with all optimizations enabled, (2) LoSHa with-out performing de-duplication, (3) LoSHa by disabling the messagereduction techniques (i.e., query broadcasting and various combin-ers), and (4) LoSHa by applying MapReduce (in C++) instead of thetwo-level join.

We first tested the performance of the four versions of LoSHarunning SimHash on Glove2.2M, with k = 20 and L = 100. We alsotested on a larger dataset, Tweets0.21B, by processing 0.21 billiontweets from Tweets1B, i.e., each machine processes 10.5 milliontweets. As Tweets0.21B is too large for single-probing SimHashto work, we used the multi-probing SimHash, with k = 16 andL = 55. We report the time and network traffic for processing 1,000queries in Table 2. We discuss the effects of different optimizationsas follows.

Effect of de-duplication. By disabling de-duplication, thequerying time of LoSHa for Glove2.2M is even reduced from 0.52sto 0.48s. This is because word vectors in Glove2.2M are relativelydifferent from each other and thus the amount of duplicate process-ing is small, as verified by the similar volumes of network trafficmeasured for LoSHa and LoSHa without de-duplication. Note thatduplicate copies may also be already removed by the two-level com-biner in LoSHa (see details in Sections 5.4 and 5.5). On the otherhand, de-duplication itself incurs overhead and thus the overalltime may be even reduced without de-duplication. For the tweetsdataset, however, disabling de-duplication significantly degradesthe performance and query processing is almost 3 times slower.This is because many tweets are highly similar and thus the amount

0

40

80

120

160

5 10 15 20

Mill

ise

con

ds

# of machines

(a) M-SimHash on Tweets1B

0

10

20

30

40

5 10 15 20

Mill

isec

on

ds

# of machines

(b) M-E2LSH on SIFT1B

Figure 6: Average query time vs. Number of machines

of duplicate processing is much larger. In addition, we use multi-probing for Tweets0.21B, and hence duplicate candidates makequeries probe more buckets in subsequent iterations. Thus, the vol-ume of network traffic is also significantly increased from 20.13GBto 74.98GB when de-duplication is disabled in LoSHa.

Effect of message reduction. The importance of the messagereduction techniques is obvious as disabling them immediatelyincreases the network traffic by 35 times for Glove2.2M and 26times for Tweets0.21B. As a result, communication cost dominatesthe total querying cost and thus the overall query performance isalso severely degraded.

Effect of two-level join. LoSHa’s data layout, which enablesthe two-level join for query processing, is the most critical opti-mization. When we replace the two-level join with a Map-Reduceprocedure, a large number of messages are transmitted over thenetwork and the processing time is also severely increased. Notethat both machine-level combiner and de-duplication are alreadyenabled. For Tweets0.21B, the system ran out of the (48 x 20)GBof aggregate memory in the cluster. Note that the Map-Reduceprocedure is essentially a hash join between queries and items ontheir signatures, but it generates much more messages than thetwo-level join employed with LoSHa’s data layout.

6.4 ScalabilityWe tested the scalability of LoSHawith the two billion-scale datasets,Tweets1B and SIFT1B. We varied the number of machines from5 to 20, while fixing the number of data items per machine to 50million.

For querying the tweets, we ranmulti-probing SimHash (denotedby M-SimHash), with k = 16 and L = 55, as in Section 6.2. Wereport the average querying time for processing 1,000 queries inFigure 6a, which shows a gentle slowdown in query processingwhen the number of machines increases. Since the number of dataitems increases in the same rate as the increase in the numberof machines, the small increase in querying time is reasonable asmore data are transmitted through the network and also morecoordination among workers is needed. We observed that therewas no obvious straggler during the whole process and thus theslowdown is more likely to be caused by communication, for whicha faster network (e.g., Infiniband instead of Gigabit Ethernet) willhelp reduce the slowdown.

We ran LoSHa E2LSH to query the SIFT data and used k = 20 asin Section 6.1. However, we could use at most 3 hash tables becausethe number of items is increased to 50 million per machine (notethat unlike a tweet, a SIFT item is a dense vector). The error ratio is

9

Page 10: LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

1.72. Thus, we implemented multi-probing E2LSH on LoSHa, whichreduces the error ratio to 1.23. There is also a small increase inquerying time when we increase the number of machines from 5to 20, as shown in Figure 6b, for the same reason as we explainedfor Figure 6a.

7 RELATEDWORKLSH [13] has been extensively studied and various hash functionfamilies for different metrics have been developed, such as lp dis-tance [5], angular distance [3], Jaccard distance [2], and so on. Sincesingle-probing [1, 18, 21, 22, 27, 28] requires a large number of hashtables and cannot scale to handle large datasets, manymulti-probingLSH algorithms have been proposed to use fewer hash tables andyet achieve comparable quality of results [7, 12, 20, 24, 29, 31, 36].However, they are single-core algorithms and many still rely onexternal memory to handle large datasets.

Recently more distributed LSH algorithms have been developedto handle larger datasets. PLSH [30] is a highly optimized LSH im-plementation for distributed SimHash [3], but the development costof such a system tailor-made only for one specific LSH algorithm istoo high for general applications. Most existing distributed LSH al-gorithms were implemented on general-purpose platforms such asHadoop [1, 18, 21], Spark [22, 28], and distributed hash tables [11].However, these general-purpose platforms are not designed for solv-ing domain-specific problems and thus the LSH implementationson these platforms are also inefficient.

For solving problems in a particular domain, quite a number ofdomain-specific systems, such as Parameter Server [17], Power-Graph [10] and LFTF [35], etc., have been developed, with whichone can easily implement different types of distributed machinelearning and graph analytics algorithms in one framework. How-ever, we are not aware of any work on the development of a generalframework that allows efficient and scalable implementation of allkinds of LSH algorithms.

8 CONCLUSIONSWe presented LoSHa, which offers a general, unified programmingframework and a user-friendly API for easy implementation of awide range of LSH algorithms. We explored various LSH-specificsystem implementation and optimizations to enable scalable LSHcomputation. Our experiments on billion-scale datasets verified theefficiency, scalability, and quality of results obtained by LoSHa’s im-plementations of a number of popular LSH algorithms. We believethat LoSHa can benefit many applications of LSH as it can signifi-cantly lower the development cost and achieve high performance.Such a general framework for LSH can also lead to the developmentof more efficient distributed algorithms for new LSH applications.

Ack. This work was partially supported by Grants (CUHK 14206715& 14222816) from the Hong Kong RGC, ITF 6904079, and Grant3132821 funded by the Research Committee of CUHK.

REFERENCES[1] B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing.

In CIKM, pages 2174–2178, 2012.[2] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise inde-

pendent permutations (extended abstract). In STOC, pages 327–336, 1998.

[3] M. Charikar. Similarity estimation techniques from rounding algorithms. InSTOC, pages 380–388, 2002.

[4] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalableonline collaborative filtering. In WWW, pages 271–280, 2007.

[5] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashingscheme based on p-stable distributions. In SCG, pages 253–262, 2004.

[6] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on largeclusters. In OSDI, pages 137–150, 2004.

[7] J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based ondynamic collision counting. In SIGMOD, pages 541–552, 2012.

[8] J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. DSH: data sensitive hashing forhigh-dimensional k-nnsearch. In SIGMOD, pages 1127–1138, 2014.

[9] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions viahashing. In VLDB, pages 518–529, 1999.

[10] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Dis-tributed graph-parallel computation on natural graphs. In OSDI, pages 17–30,2012.

[11] P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in highdimensions using locality sensitive hashing. In EDBT, pages 744–755, 2009.

[12] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitivehashing for approximate nearest neighbor search. In PVLDB, volume 9, pages1–12, 2015.

[13] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removingthe curse of dimensionality. In STOC, pages 604–613, 1998.

[14] Learning to Hash. http://cs.nju.edu.cn/lwj/l2h.html. 2017.[15] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model

parallelization and scheduling strategies for distributed machine learning. InNIPS, pages 2834–2842, 2014.

[16] J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparisonof general-purpose distributed systems for data processing. In IEEE BigData,pages 378–383, 2016.

[17] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long,E. J. Shekita, and B. Su. Scaling distributed machine learning with the parameterserver. In OSDI, pages 583–598, 2014.

[18] LikeLike. https://github.com/takahi-i/likelike. 2017.[19] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages

1–8, 2011.[20] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure

for approximate nearest neighbor search. In PVLDB, volume 7, pages 745–756,2014.

[21] LSH-Hadoop. https://github.com/lancenorskog/lsh-hadoop. 2017.[22] LSH-Spark. https://github.com/marufaytekin/lsh-spark. 2017.[23] Y. Lu, J. Cheng, D. Yan, and H. Wu. Large-scale distributed graph computing

systems: An experimental evaluation. In PVLDB, volume 8, pages 281–292, 2014.[24] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient

indexing for high-dimensional similarity search. In VLDB, pages 950–961, 2007.[25] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and

G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD,pages 135–146, 2010.

[26] L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparisonof hash function types and querying mechanisms. In Pattern Recognition Letters,volume 31, pages 1348–1358, 2010.

[27] A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massivedatasets, volume 1. 2012.

[28] SoundCloud-LSH. https://github.com/soundcloud/cosine-lsh-join-spark. 2017.[29] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest

neighbor queries in high dimensional euclidean space with a tiny index. InPVLDB, volume 8, pages 1–12, 2014.

[30] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden,and P. Dubey. Streaming similarity search over one billion tweets using parallellocality-sensitive hashing. In PVLDB, volume 6, pages 1930–1941, 2013.

[31] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensionalnearest neighbor search. In SIGMOD, pages 563–576, 2009.

[32] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. InCoRR, volume abs/1408.2927, 2014.

[33] F. Yang, Y. Huang, Y. Zhao, J. Li, G. Jiang, and J. Cheng. The best of both worlds:Big data programming with both productivity and performance. In SIGMOD,pages 1619–1622, 2017.

[34] F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressivedistributed computing framework. In PVLDB, volume 9, pages 420–431, 2016.

[35] F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. LFTF: Aframework for efficient tensor analytics at scale. In PVLDB, volume 10, pages745–756, 2017.

[36] Y. Zheng, Q. Guo, A. K. Tung, and S. Wu. Lazylsh: Approximate nearest neighborsearch for multiple distance functions with a single index. In SIGMOD, 2016.

10