indix5thelephantdraft2
TRANSCRIPT
Product Matching at ScaleNikhil KetkarData Science @ Indix 16 July 2015
Indix
Indix
Indix
Problem Definition
Crawler
Matching
ProductPages
Groups of Matching
URLS
Focus of the Talk
Business Impact
Competitive Landscape Who are your competitors? How are they pricing products? What other products do they carry?
Scale Products Sites Categories
Store
Product
Match
Matching is Central to answering
key questions
in retail analytics
Sub-Problem: Parsing
1. Title2. Image URL3. Price4. Description5. Tables
Challenges: Scale, Depth,
Diversity, Change
DOM Tree
Title or Not: Binary Classification Class
Imbalance
Parsing: Feature Engineering is Key
DOM TreeHTML Features
Visual Features
Random ForestModel
Sub-Problem: Classification
1. Title2. Image URL3. Price4. Description5. Tables
Category
Taxonomy
Challenges: Large Taxonomy, Lack of
Training Data, Changes in Taxonomy
Classification: Using Ensembles is Key
Linear SVMCNN
Ensemble
Breadcrumb Mapping
Background Knowledge
Sub-Problem: Attribute Extraction
1. Title2. Image URL3. Price4. Description5. Tables
Challenges: Large number of attributes,
bad/missing data, variability
1. Brand2. Size3. Color4. Packs5. …
Schema
Training Shoes with Different Brands
Brand:Nike
Brand:Reebok
Attribute Extraction using CRFs
Brand:Nike Color:Black/Neo Lime Total-CrimsonSole:Rubber
Sub-Problem: Blocking1. Title2. Image URL3. Price4. Description5. Tables
Challenges: No single approach works well
1. Brand2. Size3. Color4. Packs5. …
Category
Enriched Product Record
Blocking is Critical and Needs Multiple Approaches
Merge Groups
Bucketing/
Clustering
Mass Join LSH
Sub-Problem: Match Inference
Challenges: Pairwise Distance Computation,
Match at a Store Constraint
Store
Product
Match1. Pairwise Distance Computation2. Constrained Clustering
Distance Computation1. Title BOW2. Brand3. Category4. Attributes
Some Constraints cannot be Violated
Constraint Type Must Link Cannot Link
Examples UPC MPN Match at a Store
Must Link
CannotLink May Link
Use Constrained Clustering
Overall Process
Parsing Classification
AttributeExtraction Blocking Match
Inference
HTMLProduc
tRecord
Classified Products Attributes Product
GroupsMatche
s
Evaluation
Reported
Actual
Correct
Correct
Actual
Reported
Precision Sample and Spot-check
Recall Hard to estimate Rare population Manually search products
on a site to produce blind sets
Lack of Ground Truth is the
biggest road block Correct
Tools Matter
Indix