indix5thelephantdraft2
TRANSCRIPT
![Page 1: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/1.jpg)
Product Matching at ScaleNikhil KetkarData Science @ Indix 16 July 2015
![Page 2: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/2.jpg)
Indix
![Page 3: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/3.jpg)
Indix
![Page 4: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/4.jpg)
Indix
![Page 5: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/5.jpg)
Problem Definition
Crawler
Matching
ProductPages
Groups of Matching
URLS
Focus of the Talk
![Page 6: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/6.jpg)
Business Impact
Competitive Landscape Who are your competitors? How are they pricing products? What other products do they carry?
Scale Products Sites Categories
Store
Product
Match
Matching is Central to answering
key questions
in retail analytics
![Page 7: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/7.jpg)
Sub-Problem: Parsing
1. Title2. Image URL3. Price4. Description5. Tables
Challenges: Scale, Depth,
Diversity, Change
DOM Tree
Title or Not: Binary Classification Class
Imbalance
![Page 8: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/8.jpg)
Parsing: Feature Engineering is Key
DOM TreeHTML Features
Visual Features
Random ForestModel
![Page 9: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/9.jpg)
Sub-Problem: Classification
1. Title2. Image URL3. Price4. Description5. Tables
Category
Taxonomy
Challenges: Large Taxonomy, Lack of
Training Data, Changes in Taxonomy
![Page 10: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/10.jpg)
Classification: Using Ensembles is Key
Linear SVMCNN
Ensemble
Breadcrumb Mapping
Background Knowledge
![Page 11: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/11.jpg)
Sub-Problem: Attribute Extraction
1. Title2. Image URL3. Price4. Description5. Tables
Challenges: Large number of attributes,
bad/missing data, variability
1. Brand2. Size3. Color4. Packs5. …
Schema
![Page 12: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/12.jpg)
Training Shoes with Different Brands
Brand:Nike
Brand:Reebok
![Page 13: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/13.jpg)
Attribute Extraction using CRFs
Brand:Nike Color:Black/Neo Lime Total-CrimsonSole:Rubber
![Page 14: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/14.jpg)
Sub-Problem: Blocking1. Title2. Image URL3. Price4. Description5. Tables
Challenges: No single approach works well
1. Brand2. Size3. Color4. Packs5. …
Category
Enriched Product Record
![Page 15: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/15.jpg)
Blocking is Critical and Needs Multiple Approaches
Merge Groups
Bucketing/
Clustering
Mass Join LSH
![Page 16: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/16.jpg)
Sub-Problem: Match Inference
Challenges: Pairwise Distance Computation,
Match at a Store Constraint
Store
Product
Match1. Pairwise Distance Computation2. Constrained Clustering
![Page 17: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/17.jpg)
Distance Computation1. Title BOW2. Brand3. Category4. Attributes
![Page 18: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/18.jpg)
Some Constraints cannot be Violated
Constraint Type Must Link Cannot Link
Examples UPC MPN Match at a Store
Must Link
CannotLink May Link
Use Constrained Clustering
![Page 19: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/19.jpg)
Overall Process
Parsing Classification
AttributeExtraction Blocking Match
Inference
HTMLProduc
tRecord
Classified Products Attributes Product
GroupsMatche
s
![Page 20: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/20.jpg)
Evaluation
Reported
Actual
Correct
Correct
Actual
Reported
Precision Sample and Spot-check
Recall Hard to estimate Rare population Manually search products
on a site to produce blind sets
Lack of Ground Truth is the
biggest road block Correct
![Page 21: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/21.jpg)
Tools Matter
![Page 22: Indix5thElephantDraft2](https://reader036.vdocuments.us/reader036/viewer/2022062523/58eba7291a28abdc638b469f/html5/thumbnails/22.jpg)
Indix