overview of adaptive blocking for ddl research lab

7
Adaptive Blocking - key points Reduce computation time Apply across domains Maximize recall Limit false positives Disjunctive / DNF blocking Approx. Red-Blue Set Cover Increased reduction ratio Increased recall Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006

Upload: dan-chudnov

Post on 12-Jan-2017

206 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Overview of Adaptive Blocking for DDL Research Lab

Adaptive Blocking - key points• Reduce computation time

• Apply across domains

• Maximize recall

• Limit false positives

• Disjunctive / DNF blocking

• Approx. Red-Blue Set Cover

• Increased reduction ratio

• Increased recall

Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006

Page 2: Overview of Adaptive Blocking for DDL Research Lab

Blocking Predicates• Index function: generates keys based field values (e.g. first three letters

of name)

• Equality function: returns True if any set of index keys matches for a given set of record pairs

• Covered pairs: matched (equal) records for a given predicate

• Blocking function: blocking predicate set w/aggregate index & equality

Page 3: Overview of Adaptive Blocking for DDL Research Lab

find optimal blocking function that

minimizes false positives after

finding most true positives (within some error)

Page 4: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive and DNF blocking

• Disjunctive: select pairs covered by at least one blocking predicate

• Disjunctive Normal Form: select pairs covered by at least one conjunction of blocking predicates

Page 5: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive Red-blue set cover

DNF Red-blue set cover

Page 6: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive Blocking• Remove predicates covering too many false pairs

• Remove false pairs covered by too many predicates

• Predicate cost == # of false pairs

• Weighted set cover: greedy predicate selection based on improvement; check uncovered threshhold; repeat

Page 7: Overview of Adaptive Blocking for DDL Research Lab

DNF Blocking• Remove predicates covering too many pairs

• Construct predicate conjunctions, length <= k-1

• Add conjunctions maximizing marginal true/false ratio to set

• Apply Disjunctive Blocking with resulting predicates