lucene at yelp - by sudarshan gaikaiwari

Post on 14-Dec-2014

568 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

TRANSCRIPT

Lucene @ Yelp

Sudarshan Gaikaiwari

Bio

1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp

Outline

1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits

The services we provide

Lucy: business search

Lucy also powers phone search

Cathy: she 'talks' a lot

Listsearch: it searches lists....

Reviewsearch: it searches reviews....

DYM: did you really mean that?

Suggest: auto completion

Federation Motivation

Problem

Search is too slow

Hard Disk Seek LatencyDisk seek 10,000,000 ns

Source Software Engineering Advice from Building Large-Scale Distributed SystemsJeffery Dean

RAM read latency

Main memory reference100 ns

Pinning Index in RAM

● vmtouch● mlock● http://hoytech.com/vmtouch/

Problem

Index is too large fit in memory on a single machine

Geographical sharding

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.

Federation

1. �Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be

comparable

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems 1. Involves re-indexing all the businesses if we want to add a new shard

Virtual Nodes

Advantages

1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards

Lucy Master Slave Architecture

Separate indexing (masters)A master for each shard of a service

Searching (slaves)A slave for every replica of a service

Lucy Indexing

Lucy Searching

Federator: Combining results across shards1. Once we distribute an index across shards we need a

component which will search all these shards and combine their results.

2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC

Lucy Server

Tokens to Business Attributes

Executing queries

1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories

Lucene

1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the

query (word score)3. Upgrading lucene to 2.9/3.1 is WIP

Successive geobounds relaxation

Successive geobounds relaxation

Federation

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them

Distribution of hits in shards

Probability a hit is in a shard

Binomial DistributionProbability (r of top k hits) are in a particular shard

Mean

Variance

Formula

Std Deviation

Formula

Simulation

Formula Hits selected from each shard k = 100p = 0.2

Results Missed (%)

24 0.017

32 0.0001407

44 0.00000

Simulation Graph

Results

1. ~ 50% savings over 100 hits (44 hits requested from each shard)

2. 77% savings over 1000 hits (228 hits requested from each shard)

Future work

1. In memory index2. Move towards real time search

Come Join Us!

Thank You

smg@yelp.com

top related