sudarshan gaikaiwari - lucene @ yelp
DESCRIPTION
This talk describes how the Yelp uses Lucene to provide search services. It includes * Statistics of Yelp search usage * Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR * Deeper dive into business and review search. This is the most important search service at Yelp.TRANSCRIPT
![Page 1: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/1.jpg)
Lucene @ Yelp
Sudarshan Gaikaiwari
![Page 2: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/2.jpg)
Bio
1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp
![Page 3: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/3.jpg)
Outline
1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits
![Page 4: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/4.jpg)
The services we provide
![Page 5: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/5.jpg)
Lucy: business search
![Page 6: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/6.jpg)
Lucy also powers phone search
![Page 7: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/7.jpg)
Cathy: she 'talks' a lot
![Page 8: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/8.jpg)
Listsearch: it searches lists....
![Page 9: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/9.jpg)
Reviewsearch: it searches reviews....
![Page 10: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/10.jpg)
DYM: did you really mean that?
![Page 11: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/11.jpg)
Suggest: auto completion
![Page 12: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/12.jpg)
Federation Motivation
![Page 13: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/13.jpg)
Problem
Search is too slow
![Page 14: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/14.jpg)
Hard Disk Seek LatencyDisk seek 10,000,000 ns
Source Software Engineering Advice from Building Large-Scale Distributed SystemsJeffery Dean
![Page 15: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/15.jpg)
RAM read latency
Main memory reference100 ns
![Page 17: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/17.jpg)
Problem
Index is too large fit in memory on a single machine
![Page 18: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/18.jpg)
Geographical sharding
![Page 19: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/19.jpg)
Geographical Sharding drawbacks
1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
![Page 20: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/20.jpg)
Federation
1. �Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be
comparable
![Page 21: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/21.jpg)
Mapping businesses to shards
1. Assigning businesses to shards
shard = shardlist[hash(business_id) % len(shardlist)]
Problems 1. Involves re-indexing all the businesses if we want to add a new shard
![Page 22: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/22.jpg)
Virtual Nodes
![Page 23: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/23.jpg)
Advantages
1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
![Page 24: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/24.jpg)
Lucy Master Slave Architecture
Separate indexing (masters)A master for each shard of a service
Searching (slaves)A slave for every replica of a service
![Page 25: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/25.jpg)
Lucy Indexing
![Page 26: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/26.jpg)
![Page 27: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/27.jpg)
![Page 28: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/28.jpg)
![Page 29: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/29.jpg)
![Page 30: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/30.jpg)
Lucy Searching
![Page 31: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/31.jpg)
![Page 32: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/32.jpg)
Federator: Combining results across shards1. Once we distribute an index across shards we need a
component which will search all these shards and combine their results.
2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
![Page 33: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/33.jpg)
Lucy Server
![Page 34: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/34.jpg)
![Page 35: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/35.jpg)
![Page 36: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/36.jpg)
Tokens to Business Attributes
![Page 37: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/37.jpg)
Executing queries
1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
![Page 38: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/38.jpg)
Lucene
1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the
query (word score)3. Upgrading lucene to 2.9/3.1 is WIP
![Page 39: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/39.jpg)
![Page 40: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/40.jpg)
Successive geobounds relaxation
![Page 41: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/41.jpg)
Successive geobounds relaxation
![Page 42: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/42.jpg)
Federation
![Page 43: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/43.jpg)
Efficiently Retrieving top k hits
1. When user moves through multiple pages the number of hits to be returned increases
num hits = start + count
2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them
![Page 44: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/44.jpg)
Distribution of hits in shards
![Page 45: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/45.jpg)
![Page 46: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/46.jpg)
Probability a hit is in a shard
![Page 47: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/47.jpg)
Binomial DistributionProbability (r of top k hits) are in a particular shard
Mean
Variance
![Page 48: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/48.jpg)
Formula
Std Deviation
Formula
![Page 49: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/49.jpg)
Simulation
Formula Hits selected from each shard k = 100p = 0.2
Results Missed (%)
24 0.017
32 0.0001407
44 0.00000
![Page 50: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/50.jpg)
Simulation Graph
![Page 51: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/51.jpg)
Results
1. ~ 50% savings over 100 hits (44 hits requested from each shard)
2. 77% savings over 1000 hits (228 hits requested from each shard)
![Page 52: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/52.jpg)
Future work
1. In memory index2. Move towards real time search
![Page 53: Sudarshan Gaikaiwari - Lucene @ Yelp](https://reader033.vdocuments.us/reader033/viewer/2022051013/5492268dac79591b288b46dc/html5/thumbnails/53.jpg)
Come Join Us!