indix5thelephantdraft2

22
Product Matching at Scale Nikhil Ketkar Data Science @ Indix 16 July 2015

Upload: nikhil-ketkar

Post on 10-Apr-2017

398 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Indix5thElephantDraft2

Product Matching at ScaleNikhil KetkarData Science @ Indix 16 July 2015

Page 2: Indix5thElephantDraft2

Indix

Page 3: Indix5thElephantDraft2

Indix

Page 4: Indix5thElephantDraft2

Indix

Page 5: Indix5thElephantDraft2

Problem Definition

Crawler

Matching

ProductPages

Groups of Matching

URLS

Focus of the Talk

Page 6: Indix5thElephantDraft2

Business Impact

Competitive Landscape Who are your competitors? How are they pricing products? What other products do they carry?

Scale Products Sites Categories

Store

Product

Match

Matching is Central to answering

key questions

in retail analytics

Page 7: Indix5thElephantDraft2

Sub-Problem: Parsing

1. Title2. Image URL3. Price4. Description5. Tables

Challenges: Scale, Depth,

Diversity, Change

DOM Tree

Title or Not: Binary Classification Class

Imbalance

Page 8: Indix5thElephantDraft2

Parsing: Feature Engineering is Key

DOM TreeHTML Features

Visual Features

Random ForestModel

Page 9: Indix5thElephantDraft2

Sub-Problem: Classification

1. Title2. Image URL3. Price4. Description5. Tables

Category

Taxonomy

Challenges: Large Taxonomy, Lack of

Training Data, Changes in Taxonomy

Page 10: Indix5thElephantDraft2

Classification: Using Ensembles is Key

Linear SVMCNN

Ensemble

Breadcrumb Mapping

Background Knowledge

Page 11: Indix5thElephantDraft2

Sub-Problem: Attribute Extraction

1. Title2. Image URL3. Price4. Description5. Tables

Challenges: Large number of attributes,

bad/missing data, variability

1. Brand2. Size3. Color4. Packs5. …

Schema

Page 12: Indix5thElephantDraft2

Training Shoes with Different Brands

Brand:Nike

Brand:Reebok

Page 13: Indix5thElephantDraft2

Attribute Extraction using CRFs

Brand:Nike Color:Black/Neo Lime Total-CrimsonSole:Rubber

Page 14: Indix5thElephantDraft2

Sub-Problem: Blocking1. Title2. Image URL3. Price4. Description5. Tables

Challenges: No single approach works well

1. Brand2. Size3. Color4. Packs5. …

Category

Enriched Product Record

Page 15: Indix5thElephantDraft2

Blocking is Critical and Needs Multiple Approaches

Merge Groups

Bucketing/

Clustering

Mass Join LSH

Page 16: Indix5thElephantDraft2

Sub-Problem: Match Inference

Challenges: Pairwise Distance Computation,

Match at a Store Constraint

Store

Product

Match1. Pairwise Distance Computation2. Constrained Clustering

Page 17: Indix5thElephantDraft2

Distance Computation1. Title BOW2. Brand3. Category4. Attributes

Page 18: Indix5thElephantDraft2

Some Constraints cannot be Violated

Constraint Type Must Link Cannot Link

Examples UPC MPN Match at a Store

Must Link

CannotLink May Link

Use Constrained Clustering

Page 19: Indix5thElephantDraft2

Overall Process

Parsing Classification

AttributeExtraction Blocking Match

Inference

HTMLProduc

tRecord

Classified Products Attributes Product

GroupsMatche

s

Page 20: Indix5thElephantDraft2

Evaluation

Reported

Actual

Correct

Correct

Actual

Reported

Precision Sample and Spot-check

Recall Hard to estimate Rare population Manually search products

on a site to produce blind sets

Lack of Ground Truth is the

biggest road block Correct

Page 21: Indix5thElephantDraft2

Tools Matter

Page 22: Indix5thElephantDraft2

Indix