Download - Slash n near real time indexing
A real time search index for e-commerce
Umesh PrasadThejus V M
Oh!! Out Of Stock
Damn !! Out of Stock
Damn !! Missed the Offer
E-commerce Index Attributes
catalogue service
Promise Engine
Availability Service
Seller Rating
LISTINGPRODUCT aka SKU
OfferEngine
PricingEngine
Out Of Stock, but Why Show?
Index has Stale Availability Data
234K Products
Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
Challenge 1 : Update rates
updates / secmax update /hr
min maxtext / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
Challenge 2 : Lucene Index Update
● Lucene doesn’t support Partial Updates.● Update = Delete Old Doc + Add New Document
– Recreate the entire document for every update– Not friendly with multiple micro-services with
different update rates
● Problem Compounded By MarketPlace ● Product + All Its Listings == SINGLE BLOCK● BLOCK structure chosen for query performance ( ~100X
better latencies)
Challenge 3 : Refresh Cycle
Ingestion pipeline Solr Master
Solr Slave
Solr Slave
Solr Slave
Solr Slave
Solr Slave
Solr Slave
Commit fsync
Replication
Open new Index
Open new Index
Open new Index
Open new Index
Open new Index
Open new Index
Batch of documents
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : F
price : 23000
ProductC
brand : Apple
availability : T
price : 5000
Document ID Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 1
2
0 , 2
Terms Sparse Bitsets
Root Cause :Updating Data Structures
Term 3 Bitset 3
POSTING LIST
………………………...Millions of Terms
BitSet 1Term 1
BitSet 2Term 2
BitSet 3Term 3
Document
Term1 Term2 Term3 Term4
………………………...Thousands of Terms
Posting List / Bit Set
D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1
S: 2,7,14
SE : 2,5,7
Yes
May Be
NO
Updatable ?
Millions of Documents
Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
● Hook it to Lucene
NRT Store - Forward Index Naive
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductC
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductC,price>
DocId : 3field : price
200
ProductId Availability Price
Latency : ~10 secs for ~1 Million lookups
NRT Store - Forward Index Optimized
NRT Forward Index (Segment Independent)
Lucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
100 200 250 150
NrtId(3)
2
DocId : 3field : price
200Availability
Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
T F F T
DocId - NrtId
0
1
2
3
3
0
1
2
Price(2)
200
NRT Store - Invert index
NRT Forward Store
NRT Invert Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Availability : T 0 3
Offer : O1 2 3
Availability:T Matching BitSet
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward Index
Ranking
Macthing
Faceting
Redis
Bootstrap
NRT Inverted store
Solr Master
NRT Updates
Text Updates
Catalogue
Pricing
Availability
Offers
Seller Quality
Commit+
Replicate+
Reopen
LuceneOthers
Accomplishments
● Real time consumption for Ranking Signals
● BBD saw upto ~30K updates/second
● Query latency comparable to DocValues
– Consistent 99% performance
Thank you&
Questions
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Schema
Other Components
Lucene Index
Inverted Index
Forward Index
Schema
NRT Store
Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms Dictionary
Posting List (inverted index)
Doc Value(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka Segments
● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)
Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms Dictionary
Posting List (inverted index)
Doc Value(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka Segments
● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)
C5 : Lucene in-place update
● Only numeric / byte Array fields
● Updates to go through the entire refresh cycle
● Not exposed via Solr
Forward Index - API Hook
● Lucene API Hook
– ValueSource
● Input
– Lucene Internal Document Id
– Field Name
● Output
– Field Value
NRT Store - Inverted Index
● Input
– Lucene Segment
– query
• Field Name : Field Value
• offer : o1
● Output
– DocSet (posting list)