slash n near real time indexing

Post on 16-Apr-2017

1.242 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A real time search index for e-commerce

Umesh PrasadThejus V M

Oh!! Out Of Stock

Damn !! Out of Stock

Damn !! Missed the Offer

E-commerce Index Attributes

catalogue service

Promise Engine

Availability Service

Seller Rating

LISTINGPRODUCT aka SKU

OfferEngine

PricingEngine

Out Of Stock, but Why Show?

Index has Stale Availability Data

234K Products

Outline

❏ E-commerce search Challenge

❏ Challenges in Keeping an Inverted Index Updated

❏ Our approach to Near Real Time indexing

Challenge 1 : Update rates

updates / secmax update /hr

min maxtext / catalogue ~10 ~100 ~100K

pricing ~100 ~1K ~10 million

availability ~100 ~10K ~10 million

offer ~100 ~10K ~10 million

seller rating ~10 ~1K ~1 million

signal 6 ~10 ~100 ~1 million

signal 7 ~100 ~10K ~10 million

signal 8 ~100 ~10K ~10 million

Challenge 2 : Lucene Index Update

● Lucene doesn’t support Partial Updates.● Update = Delete Old Doc + Add New Document

– Recreate the entire document for every update– Not friendly with multiple micro-services with

different update rates

● Problem Compounded By MarketPlace ● Product + All Its Listings == SINGLE BLOCK● BLOCK structure chosen for query performance ( ~100X

better latencies)

Challenge 3 : Refresh Cycle

Ingestion pipeline Solr Master

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Solr Slave

Commit fsync

Replication

Open new Index

Open new Index

Open new Index

Open new Index

Open new Index

Open new Index

Batch of documents

ProductA

brand : Apple

availability : T

price : 45000

ProductB

brand : Samsung

availability : F

price : 23000

ProductC

brand : Apple

availability : T

price : 5000

Document ID Mappings

Posting List

(Inverted Index)

DocValues

(columunar data)

Lucene Segment

Lucene Index

0 ProductA

1 ProductB

2 ProductC

45000 23000 5000Price

availability : T

brand : Samsung

brand : Apple 0 , 1

2

0 , 2

Terms Sparse Bitsets

Root Cause :Updating Data Structures

Term 3 Bitset 3

POSTING LIST

………………………...Millions of Terms

BitSet 1Term 1

BitSet 2Term 2

BitSet 3Term 3

Document

Term1 Term2 Term3 Term4

………………………...Thousands of Terms

Posting List / Bit Set

D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1

S: 2,7,14

SE : 2,5,7

Yes

May Be

NO

Updatable ?

Millions of Documents

Outline

❏ E-commerce search Challenge

❏ Challenges in Keeping an Inverted Index Updated

❏ Our approach to Near Real Time indexing

A Typical Search Flow

Query Rewrite

Results

Query

Matching

Ranking Faceting

Stats

Posting List

Doc Values

Other Components

Lucene Segment

Inverted Index

Forward Index

NRT Store

NRT Forward Index - Considerations

● Lookup efficiency

– 50th percentile : ~10K matches

– 99th percentile : ~1 million matches

● Data on Java heap

– Memory efficiency

● Hook it to Lucene

NRT Store - Forward Index Naive

NRT Forward IndexLucene Segment

Lookup Engine

0 ProductB

1 ProductA

2 ProductC

3 ProductD

ProductC

ProductA

ProductB

ProductC

ProductD

True

False

False

True

100

150

200

250

ProductId(3) <ProductC,price>

DocId : 3field : price

200

ProductId Availability Price

Latency : ~10 secs for ~1 Million lookups

NRT Store - Forward Index Optimized

NRT Forward Index (Segment Independent)

Lucene Segment

Lookup Engine

0 ProductB

1 ProductA

2 ProductC

3 ProductD

100 200 250 150

NrtId(3)

2

DocId : 3field : price

200Availability

Price

0 ProductA

1 ProductC

2 ProductD

3 ProductB

T F F T

DocId - NrtId

0

1

2

3

3

0

1

2

Price(2)

200

NRT Store - Invert index

NRT Forward Store

NRT Invert Store

NRT Inverter

Lucene Segment

0 ProductB

1 ProductA

2 ProductC

3 ProductD

Availability : T 0 3

Offer : O1 2 3

Availability:T Matching BitSet

Near Real Time Solr Architecture

Solr

Kafka

Ingestion pipeline

NRT Forward Index

Ranking

Macthing

Faceting

Redis

Bootstrap

NRT Inverted store

Solr Master

NRT Updates

Text Updates

Catalogue

Pricing

Availability

Offers

Seller Quality

Commit+

Replicate+

Reopen

LuceneOthers

Accomplishments

● Real time consumption for Ranking Signals

● BBD saw upto ~30K updates/second

● Query latency comparable to DocValues

– Consistent 99% performance

Thank you&

Questions

A Typical Search Flow

Query Rewrite

Results

Query

Matching

Ranking Faceting

Stats

Posting List

Doc Values

Schema

Other Components

Lucene Index

Inverted Index

Forward Index

Schema

NRT Store

Lucene Index

0 availability:true 0,2

1 availability:false 1

0 brand:adidas 0,1

1 brand:nike 2

1 price:230 1

2 price:250 0

term ords Terms Dictionary

Posting List (inverted index)

Doc Value(Forward index)

field 0 1 2

price 2 2 3

brand 0 0 1

availability 0 1 0

docId External ID Brand Availability Price

0 ProductA Adidas True 250

1 ProductB Adidas False 230

2 ProductC Nike True 500

● Lucene Index = Multiple Mini Indexes aka Segments

● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)

Lucene Index

0 availability:true 0,2

1 availability:false 1

0 brand:adidas 0,1

1 brand:nike 2

1 price:230 1

2 price:250 0

term ords Terms Dictionary

Posting List (inverted index)

Doc Value(Forward index)

field 0 1 2

price 2 2 3

brand 0 0 1

availability 0 1 0

docId External ID Brand Availability Price

0 ProductA Adidas True 250

1 ProductB Adidas False 230

2 ProductC Nike True 500

● Lucene Index = Multiple Mini Indexes aka Segments

● Lucene Segment○ Write Once → Immutable Data structures○ Posting Listing ( Sparse encoded bitsets)○ Doc Values (Columnar Data structures)

C5 : Lucene in-place update

● Only numeric / byte Array fields

● Updates to go through the entire refresh cycle

● Not exposed via Solr

Forward Index - API Hook

● Lucene API Hook

– ValueSource

● Input

– Lucene Internal Document Id

– Field Name

● Output

– Field Value

NRT Store - Inverted Index

● Input

– Lucene Segment

– query

• Field Name : Field Value

• offer : o1

● Output

– DocSet (posting list)

top related