anatomy of an ecommerce search engine by mayur datar
TRANSCRIPT
● Search is one of the most important discovery tools in E-commerce.
● Powers other features like merchandising (promotions), recommendations etc.
● Accounts for big fraction of the units sold and GMV.
● Important signals that affect search: Price, offers, popularity, availability, serviceability etc.
● Used in ranking of products.
● Exposed as filters and sorts to end users.
● These signals are very dynamic, particularly during sales.
● E-commerce search != websearch.● Documents have a structure to them● Queries have an implicit structure
● Challenges:○ Large document collection with a long heavy tail○ Extremely high rate of changes/updates (Thousands per sec)○ Geo specific ranking○ Multi-objective optimization (GMV, Units, Ads revenue, Long
Term Value)
● Opportunities:○ Broad queries: personalization can play a huge role
● Queries per day: XXX Millions / week● Latencies:
○ Average: ~ 100 ms○ Median: ~ 50 ms○ 90th percentile: ~ 500 ms
● Documents retrieved and scored from index:○ Median: 1K to 10K○ 95th percentile: 200K to 500K○ 99th percentile: 500K to 3M+
● Search CTR: Around 50%
● Architectural overview of the search platform○ Serving and Ingestion○ Serving functional view○ Serving architectural view○ Ingestion architectural view○ Example ingestion topology
● Search quality○ Challenges○ Life of a query: Typical flow for query understanding○ Illustrative problems
● 1,000,000 Compute Cores● 2.56 Petabytes RAM● 120 Petabytes Disk
Storage● 1 Petabytes NVMe SSD● 128 Tbps bisection
bandwidth Clos network
Query Rewriter(Spell Check, Concept, NLP, Intent, Augmentation,Retrieval/Scoring query formulation)
Reverse Proxy(Geo Coding, User Context, Caching, Isolation, Rate Limit, Tee-off test framework)
Search Broker(Distributed Search across shards, Blending Of Results from shards)
Searcher(Matching, Scoring, Faceting, Top-K Retrieval (pass-1 ranking))
Text index NRT index
Metadata
Re-ranking(Pass-2 Ranking) - ML Model
Pluggable Ranking Models
Pluggable Rewriter Modules
Serving:Arch View
● Architectural overview of the search platform○ Serving and Ingestion○ Serving functional view○ Serving architectural view○ Ingestion architectural view○ Example ingestion topology
● Search quality○ Challenges○ Life of a query: Typical flow for query understanding○ Illustrative problems
● Marketplace○ Catalog entries vary in quality from seller to seller. Spam is
rampant.● Diversity of users● Mobile heavy users: Real estate on UI● Poor internet connectivity
● Literacy/Internet awareness● Language● Economic power● Regional preferences
Abstraction: City-tier
Query/Intent SolicitationResult Presentation
Product Ranking
40% increase in proportion of tier-3 customers vis-a-vis metro
Query: samsang
Relative ratio of query Tier-3 Vs Metro: 1.8
Query: jins
Relative ratio of query Tier-3 Vs Metro: 2.2
Query Scoring
Normalisation (Index time as well)
- String clean-up- lower
Spell Correction- Resource-based
- term->term- Query->query
- Online
Init Context
Phrasing (Index time as well)
- Frequent bi/tri grams
Stemming (Index time as well)
- Core e-commerce stemmer
- plurals
Common MetaData Store (Query Level)- Raw Data: metrics (CTR, Impression, NDCG…)- Derived Data: Store, LM score, Features
Synonyms- Resource-based
Intent- Deductions- Tagging (CRF)
Query Rewrite- Best query selection- Partial match
SOLR interface
Query Understanding Output Generator
FDP
Retrieval ranking logic
Store Classifier
Query LMFeature Store
Classification
• Special patterns:– Segmented words: lgnexus5Counting: “samsang” & no-click followed by “samsung”& click a million times– Context aware counting
• Language modeling and edit distance• Term to vector models in deep learning.
Specific
General
● Intent: From query tokens to (implicit) attributes that are represented by those tokens
● Examples:○ “red tape shoes” -> (brand) “red tape” (store) “shoes”○ “kids party dress 4-5 years pack of 2” -> (ideal_for) “kids”
(occasion) “party” (store) “dress” (size) “4-5 years” (pack_of) “pack of 2”
○ “samsung e6 cases” -> (“compatible_with”) “samsung e6” (store) “cases”
● Memorization, Language modeling, CRF
Past orders Product Views
Users’ activity on the platform
Customised Search Ranking for User-segment
economical expensive
shoes
watches
Past orders Product Views
5 price ranges defined for each vertical.
1 2 3 4 5
User-Segments based on price affinities
Users’ past activity on the platform.
Customised Search Ranking for each User-segment
Price Personalization
# of
use
rs