matching and clustering croduct descriptions using learned similarity metrics william w. cohen...

59
Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration and the Web Joint work with Frank Lin, John Wong, Natalie Glance, Charles Schafer, Roy Tromble

Upload: lucas-parker

Post on 27-Mar-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Matching and Clustering Croduct Descriptions using Learned

Similarity Metrics

William W. CohenGoogle & CMU

2009 IJCAI Workshop on Information Integration and the Web

Joint work with Frank Lin, John Wong, Natalie Glance, Charles Schafer, Roy Tromble

Page 2: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Scaling up Information Integration

• Small scale integration– Few relations, attributes, information sources, …

– Integrate using knowledge-based approaches

• Medium scale integration– More relations, attributes, information sources, …

– Statistical approaches work for entity matching (e.g., TFIDF) …

• Large scale integration– Many relations, attributes, information sources, …

– Statistical approaches appropriate for more tasks

– Scalability issues are crucial

Page 3: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Scaling up Information Integration

• Outline:– Product search as a large-scale II task– Issue: determining identity of products with context-

sensitive similarity metrics– Scalable clustering techniques– Conclusions

Page 4: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Google Product Search: A Large-Scale II Task

Page 5: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

The Data

• Feeds from merchants Attribute/value data

•where attribute & data can be any strings• The web

Merchant sites, their content and organization Review sites, their content and organization Images, videos, blogs, links, …

• User behavior Searches & clicks

Page 6: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Challenges: Identifying Bad Data

• Spam detection• Duplicate merchants• Porn detection• Bad merchant names• Policy violations

Page 7: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Challenges: Structured data from the web

• Offers from merchants• Merchant reviews• Product reviews• Manufacturer specs• ...

Page 8: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Challenges: Understanding Products

• Catalog construction• Canonical description, feature values, price ranges, ....

• Taxonomy construction• Nerf gun is a kind of toy, not a kind of gun

• Opinion and mentions of products on the web

• Relationships between products

• Accessories, compatible replacements, • Identity

Page 9: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Google Product Search: A Large-Scale II Task

Page 10: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Challenges: Understanding Offers

• Identity• Category• Brand name• Model number• Price• Condition• ...

Plausible baseline for determining if two products are identical:1) pick a feature set2) measure similarity with cosine/IDF, ...3) threshold appropriately

Page 11: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Challenges: Understanding Offers

• Identity• Category• Brand name• Model number• Price• Condition• ...

Plausible baseline for determining if two products are identical:1) pick a feature set2) measure similarity with cosine/IDF, ...3) threshold appropriately

Advantages of cosine/IDF:• Robust: works well for many types of entities• Very fast to compute sim(x,y)• Very fast to find y: sim(x,y) > θ using inverted indices• Extensive prior work on similarity joins• Setting IDF weights

• requires no labeled data• requires only one pass over the unlabeled data• easily parallelized

Page 12: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Product similarity: challenges

• Similarity can be high for descriptions of distinct items:

o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...

o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop .. 

• Similarity can be low for descriptions of identical items:

o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust  ...

o  CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

Page 13: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Product similarity: challenges

• Linguistic diversity and domain-dependent technical specs:o "Camera angle finder" vs "right angle finder", "Dioptric adjustment“;

"Aerospec Designed", "V countertop edge", ...• Labeled training data is not easy to produce for subdomains• Imperfect and/or poorly adopted standards for identifiers • Different levels of granularity in descriptions

• Brands, manufacturer, …o Product vs. product serieso Reviews of products vs. offers to sell products

• Each merchant is different: intra-merchant regularities can dominate the intra-product regularities

Page 14: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Clustering objects from many sources

• Possible approaches– 1) Model the inter- and intra- source variability directly

(e.g., Bagnell, Blei, McCallum UAI2002; Bhattachrya & Getoor SDM 2006); latent variable for source-specific effects

– Problem: model is larger and harder to evaluate

Page 15: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Clustering objects from many sources

• Possible approaches– 1) Model the inter- and intra- source variability directly

– 2) Exploit background knowledge and use constrained clustering:

• Each merchant's catalogs is duplicate-free

• If x and y are from the same merchant constrain cluster so that CANNOT-LINK(x,y)

– More realistically: locally dedup each catalog and use a soft constraint on clustering

• E.g., Oyama &Tanaka, 2008 - distance metric learned from cannot-link constraints only using quadratic programming

• Problem: expensive for very large datasets

Page 16: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Scaling up Information Integration

• Outline:– Product search as a large-scale II task

– Issue: determining identity of products• Merging many catalogs to construct a larger catalog

• Issues arising from having many source catalogs

• Possible approaches based on prior work

• A simple scalable approach to exploiting many sources– Learning a distance metric

• Experiments with the new distance metric

– Scalable clustering techniques

– Conclusions

Page 17: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

• ci is source (“context”) of item xi (the selling merchant)

• Df is set of items with feature f

• xi ~ Df is uniform draw

• nc,f is #items from c with feature f

plus smoothing

Page 18: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

• ci is source of item xi

• Df is set of items with feature f

• xi ~ Df is uniform draw

• nc,f is #items from c with feature f

plus smoothing

Page 19: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

)(IDF)(CX)(CX.IDF fff

Page 20: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Motivations

• Theoretical: CX(f) related to naïve Bayes weights for a classifier of pairs of items (x,y):– Classification task: is the pair intra- or inter-source?– Eliminating intra-source pairs enforces CANNOT-LINK

constraints; using naïve Bayes classifier approximates this– Features of pair (x,y) are all common features of item x and

item y– Training data: all intra- and inter-source pairs

• Don’t need to enumerate them explicitly

• Experimental: coming up!

Page 21: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Smoothing the CX(f) weights

1. When estimating Pr( _ | xi,xj ), use a Beta distribution with (α,β)=(½,½).

2. When estimating Pr( _ | xi,xj ) for f use a Beta distribution with (α,β) computed from (μ,σ)

– Derived empirically using variant (1) on features “like f”—i.e., from the same dataset, same type, …

3. When computing cosine distance, add “correction” γ

Page 22: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Efficiency of setting CX.IDF

• Traditional IDF:– One pass over the dataset to derive weights

• Estimation with (α,β)=(½,½) :– One pass over the dataset to derive weights– Map-reduce can be used– Correcting with fixed γ adds no learning overhead

• Smoothing with “informed priors”:– Two passes over the dataset to derive weights– Map-reduce can be used

Page 23: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Scaling up Information Integration

• Outline:– Product search as a large-scale II task

– Issue: determining identity of products• Merging many catalogs to construct a larger catalog

• Issues arising from having many source catalogs

• Possible approaches based on prior work

• A simple scalable approach to exploiting many sources– Learning a distance metric

• Experiments with the new distance metric

– Scalable clustering techniques

– Conclusions

Page 24: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Warmup: Experiments with k-NN classification

• Classification vs matching: – better-understood problem with fewer “moving parts”

• Nine small classification datasets – from Cohen & Hirsh, KDD 1998

– instances are short, name-like strings

• Use class label as context (metric learning)– equivalent to MUST-LINK constraints

– stretch same-context features in “other” direction• Heavier weight for features that co-occur in same-context pairs

• CX-1.IDF weighting (aka IDF/CX).

Page 25: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments with k-NN classification

Procedure: • learn similarity metric from (labeled) training data• for test instances, find closest k=30 items in training data and predict distance-weighted majority class• predict majority class in training data if no neighbors with similarity > 0

Page 26: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments with k-NN classification

Ratio of k-NN error to baseline k-NN errorLower is better* Statistically significantly better than baseline

(α,β)=(½,½) (α,β) from (μ,σ)

Page 27: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments Matching Bibliography Data

• Scraped LaTex *.bib files from the web:– 400+ files with 100,000+ bibentries– All contain the phrase “machine learning”– Generated 3,000,000 “weakly similar” pairs of bibentries– Scored and ranked the pairs with IDF, CX.IDF, …

• Used paper URLs and/or DOIs to assess precision– About 3% have useful identifiers– Pairings between these 3% can be assessed as right/wrong

• Smoothing done using informed priors– Unsmoothed weights averaged over all tokens in a specific

bibliography entry field (e.g., author)

• Data is publicly available

Page 28: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Matching performance for bibliography entries

(α,β)=(½,½)

(α,β) from (μ,σ)Baseline IDF

Interpolated precision versus rank (γ=10, R<10k)

Page 29: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Known errors versus rank (γ=10, R<10k)

(α,β)=(½,½)

(α,β) from (μ,σ)

Baseline IDF

Page 30: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Matching performance for bibliography entries - at higher recall

Errors versus rank (γ=10, R>10k)

Page 31: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments Matching Product Data

• Data from >700 web sites, merchants, hand-built catalogs• Larger number of instances: > 40M• Scored and ranked > 50M weakly similar pairs• Hand-tuned feature set

– But tuned on an earlier version of the data

• Used hard identifiers (ISBN, GTIN, UPC) to assess accuracy– More than half have useful hard identifiers– Most hard identifiers appear only once or twice

Page 32: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments Matching Product Data

(α,β)=(½,½)

(α,β) from (μ,σ)

Baseline IDF

Page 33: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments with product data

(α,β)=(½,½)(α,β) from (μ,σ)

Baseline IDF

Page 34: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Scaling up Information Integration

• Outline:– Product search as a large-scale II task– Issue: determining identity of products with context-

sensitive similarity metrics– Scalable clustering techniques (w/ Frank Lin)

• Background on spectral clustering techniques

• A fast approximate spectral technique

• Theoretical justification

• Experimental results

– Conclusions

Page 35: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = Matrix

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A 1 1 1

B 1 1

C 1

D 1 1

E 1

F 1 1 1

G 1

H 1 1 1

I 1 1 1

J 1 1

Page 36: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixTransitively Closed Components = “Blocks”

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

Of course we can’t see the “blocks” unless the nodesare sorted by cluster…

Page 37: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixVector = Node Weight

H

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

GI

J

A

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v

Page 38: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixM*v1 = v2 “propogates weights from neighbors”

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

D

E

F

G

H

I

J

v2* =

Page 39: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

A B C D E F G H I J

A _ .5 .5 .3

B .3 _ .5

C .3 .5 _

D _ .5 .3

E .5 _ .3

F .3 .5 .5 _

G _ .3 .3

H _ .3 .3

I .5 .5 _ .3

J .5 .5 .3 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

M

W v1

A 2*.5+3*.5+0*.3

B 3*.3+3*.5

C 3*.33+2*.5

D

E

F

G

H

I

J

v2* =W: normalized so columns sum to 1

Page 40: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

[Shi & Meila, 2002]

λ2

λ3

λ4

λ5,6,7,….

λ1e1

e2

e3

“eigengap”

Page 41: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

[Shi & Meila, 2002]

e2

e3

-0.4 -0.2 0 0.2

-0.4

-0.2

0.0

0.2

0.4

xx x xx x

yyyy

y

xx xxx x

zzzzz z

zzzz z e1

e2

Page 42: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

If Wis connected but roughly block diagonal with k blocks then• the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Page 43: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

If W is connected but roughly block diagonal with k blocks then• the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Spectral clustering:• Find the top k+1 eigenvectors v1,…,vk+1

• Discard the “top” one• Replace every node a with k-dimensional vector xa = <v2(a),…,vk+1 (a) >

• Cluster with k-means

Page 44: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Spectral Clustering: Pros and Cons

• Elegant, and well-founded mathematically• Works quite well when relations are

approximately transitive (like similarity)• Very noisy datasets cause problems

– “Informative” eigenvectors need not be in top few

– Performance can drop suddenly from good to terrible

• Expensive for very large datasets– Computing eigenvectors is the bottleneck

• There is a very scalable way to compute the top eigenvector

Page 45: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Aside: power iteration to compute the top eigenvector

• Let v0 be almost any vector

• Repeat until convergence (c is a normalizer):

– vt = cW*vt-1

• This is how PageRank is computed– For a different W

• This converges to the top eigenvector– Which in this case is constant

– But …

Page 46: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Convergence of PI for a clustering problem:*each box is rescaled to same vertical range

smal

lla

rger

Page 47: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Explanation: ???

ei is i-th eigenvector

Page 48: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Explanation

Page 49: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Explanation

(converges to zero even more quickly)

(converges to zero quickly)

Page 50: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Explanation

eigenvectors are piecewise constant across the clusters and some pair of constants is really different for each pair of clusters

Page 51: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Explanation: the signal approximates spectral clustering’s distance

2

1

lo

t

cR

space means-k in the distance ),( baspec

…but all the pic(a,b) distances are in a small radius:

Page 52: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

PIC: Power Iteration Clustering

Details:• run k-means 10 times and pick best output

• by intra-cluster similarity)• stopping condition: acceleration < 10-5/n

Page 53: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experimental Results

Page 54: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experimental results: best-case assignment of class labels to clusters

Page 55: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments: run time and scalability

Time in millisec

Page 56: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Experiments: run time and scalability

Page 57: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Summary

• Large-scale integration:• New statistical approaches: assuming huge numbers of objects,

relations, and sources of information• Simplicity and scalability is crucial

• CX.IDF is an extension of IDF weighting• Exploits statistics in data merged from many locally-deduped

sources, a very common integration scenario• Weights can be “learned” without labeling• Weight “learning” requires 2-3 passes over the data• Errors are reduced significantly relative to IDF

• 20% lower error on average for classification• Up to 65% lower error in matching tasks at high recall levels• Very high precision possible at lower recall levels

Page 58: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Summary

• Large-scale integration:• New statistical approaches: assuming huge numbers of objects,

relations, and sources of information• Simplicity and scalability is crucial

• CX.IDF is an extension of IDF weighting• Simple, scalable, parallelizable

• PIC is a very scalable clustering method• Formally, works when spectral techniques work• Experimentally, often better than traditional spectral methods• Based on power iteration on a normalized matrix with early

stopping• Experimentally, linear time• Easy to implement and efficient• Very easily parallelized

Page 59: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration

Questions...?