near real time indexing: presented by umesh prasad & thejus v m, flipkart
TRANSCRIPT
![Page 1: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/1.jpg)
OCTOBER 11-‐14, 2016 • BOSTON, MA
![Page 2: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/2.jpg)
Near Real 8me Indexing Building Real Time Search Index For E-‐Commerce
Umesh Prasad
Tech Lead @ Flipkart
Thejus V M Data Architect @ Flipkart
![Page 3: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/3.jpg)
Agenda
• Search @ Flipkart • Need for Real Time Search • SolrCloud Solu;on • Our approach • Q & A
![Page 4: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/4.jpg)
![Page 5: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/5.jpg)
![Page 6: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/6.jpg)
Traffic @ Flipkart
• Peak Traffic – ~ 800K ac;ve users – ~ 160K requests per second
• Search Traffic – ~ 40K searches per second (Service) – ~ 10K searches per second (Solr )
• Latency – Median : 11 ms – 99th percen;le : 1.1 second
![Page 7: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/7.jpg)
Search @ Flipkart
• Catalogue – ~ 50 main categories – ~ 5000 sub-‐categories – ~ 231 million documents – ~ 90 million SKUs – ~ 160 million lis;ngs
• E-‐commerce Marketplace – ~ 100K Sellers – Local Sellers – Regional Availability – Logis;cs Constraints
![Page 8: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/8.jpg)
E-‐commerce Search
• Heavy usage of drill down filters • Heavy usage of face;ng • Only top results ma\er • Results grouped/collapsed by products • Serviceability and delivery experience MATTERS
![Page 9: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/9.jpg)
Agenda
• Search @ Flipkart • Need for Real Time Search • SolrCloud Solu;on • Our approach • Q & A
![Page 10: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/10.jpg)
Sorry, Stock Over !!?
![Page 11: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/11.jpg)
Damn !! Is Offer Over ??
![Page 12: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/12.jpg)
What !! All Steal Deals Gone ??
![Page 13: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/13.jpg)
Product /Lis;ng: Important A\ributes
Seller Ra;ng Service
catalogue service
Promise Service
Availability Service
Offer Service
Pricing Service
Product aka SKU
Lis;ngs
![Page 14: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/14.jpg)
Summary : Lucene Document • Product/SKU (Parent Document)
– Lis;ng (Child Document)
• Query : Mostly SKU A\ributes (Free Text) • Filters : SKU + Lis;ng A\ributes (Drill Down) • Ranking : SKU + Lis;ng A\ributes (Explicit/
Relevance)
• Index Time Join aka Block Join (Best Performance)
![Page 15: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/15.jpg)
Out Of Stock, but Why Show? Index has Stale Availability Data
234K Products
![Page 16: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/16.jpg)
Challenge 1 : High Update Rates
updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller ra8ng ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
![Page 17: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/17.jpg)
Challenge 2 : Micro Services
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change Propagation
Documents {L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
● Lucene doesn’t support Partial Updates ● Update = Delete + Add
![Page 18: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/18.jpg)
Agenda
• Search @ Flipkart • Need for Real Time Search • SolrCloud Solu;on • Our approach • Q & A
![Page 19: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/19.jpg)
SolrCloud for NRT
Shard Replica
Shard Replica
Shard Replica
Shard Replica
Shard Replica
Shard Replica
Re-open searcher
Re-open searcher
Re-open searcher
Re-open searcher
Re-open searcher
Re-open searcher
Ingestion pipeline Shard Leader
Auto commit Soft Commit
Batch of documents
For Document Versioning Update Log Forward to Replica
![Page 20: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/20.jpg)
SolrCloud Evalua;on • Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Lis;ngs) • Updated Document gets streamed to all replicas in sync
– Reduces indexing throughput • Sol commit is Not Free
– Sol commit ⇒ In Memory Segment – Lots of Merges – Huge document churn / deletes – All caches s;ll need to be re-‐generated – Filter Cache miss specially hurts performance
![Page 21: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/21.jpg)
Agenda
• Search @ Flipkart • Need for Real Time Index • SolrCloud Solu;on • Our approach
• Q & A
![Page 22: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/22.jpg)
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000 Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms Sparse Bitsets
![Page 23: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/23.jpg)
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles Offer : exchange offer price desc
category : mobiles brand : samsung Offer : exchange offer
![Page 24: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/24.jpg)
NRT Forward Index -‐ Considera;ons
● Lookup efficiency
– 50th percen;le : ~10K matches
– 99th percen;le : ~1 million matches
● Data on Java heap – Memory efficiency
![Page 25: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/25.jpg)
NRT Forward Index -‐ Naive Implementa;on
NRT Forward Index Lucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
DocId : 3 field: price
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million lookups
![Page 26: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/26.jpg)
NRT Store -‐ Forward Index Op;mized
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3 Field : price
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
Status
NRT Forward Index (Segment Independent)
100 200 250 150 Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
![Page 27: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/27.jpg)
NRT Store Filter -‐ PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Don’t Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
Status
NRT Forward Index (Segment Independent)
100 200 250 150 Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
![Page 28: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/28.jpg)
NRT Filter
NRT Store -‐ Invert index
NRT Forward Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet
![Page 29: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/29.jpg)
Solr Integra;on Points
• ValueSources • Filtering
– Custom Filter Implementa;on for cached DocIdSet – Custom PostFilter
• Query – Wrapper over Filter
• Custom FacetComponent
![Page 30: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/30.jpg)
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller Quality
Commit +
Replicate +
Reopen
Lucene Others
![Page 31: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/31.jpg)
Accomplishments
• Real ;me sor;ng • Real ;me filtering : PostFilter
– Higher latency • Near real ;me filtering : cached DocIdSet
– No consistency between lookup and filtering • Independent of lucene commits • Query latency comparable to DocValues
– Consistent 99% performance
![Page 32: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/32.jpg)
Accomplishments @ Flipkart
● Real ;me consump;on for ~150 Signals
● Reduc;on in shown out of stock products by 2X ● Produc;on instances of ~50K updates/second real ;me
![Page 33: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart](https://reader031.vdocuments.us/reader031/viewer/2022022414/587067c71a28ab48378b550b/html5/thumbnails/33.jpg)
Thank you &
Ques8ons