overview of adaptive blocking for ddl research lab
TRANSCRIPT
![Page 1: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/1.jpg)
Adaptive Blocking - key points• Reduce computation time
• Apply across domains
• Maximize recall
• Limit false positives
• Disjunctive / DNF blocking
• Approx. Red-Blue Set Cover
• Increased reduction ratio
• Increased recall
Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006
![Page 2: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/2.jpg)
Blocking Predicates• Index function: generates keys based field values (e.g. first three letters
of name)
• Equality function: returns True if any set of index keys matches for a given set of record pairs
• Covered pairs: matched (equal) records for a given predicate
• Blocking function: blocking predicate set w/aggregate index & equality
![Page 3: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/3.jpg)
find optimal blocking function that
minimizes false positives after
finding most true positives (within some error)
![Page 4: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/4.jpg)
Disjunctive and DNF blocking
• Disjunctive: select pairs covered by at least one blocking predicate
• Disjunctive Normal Form: select pairs covered by at least one conjunction of blocking predicates
![Page 5: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/5.jpg)
Disjunctive Red-blue set cover
DNF Red-blue set cover
![Page 6: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/6.jpg)
Disjunctive Blocking• Remove predicates covering too many false pairs
• Remove false pairs covered by too many predicates
• Predicate cost == # of false pairs
• Weighted set cover: greedy predicate selection based on improvement; check uncovered threshhold; repeat
![Page 7: Overview of Adaptive Blocking for DDL Research Lab](https://reader031.vdocuments.us/reader031/viewer/2022030304/5877265e1a28ab2b2c8b4cdb/html5/thumbnails/7.jpg)
DNF Blocking• Remove predicates covering too many pairs
• Construct predicate conjunctions, length <= k-1
• Add conjunctions maximizing marginal true/false ratio to set
• Apply Disjunctive Blocking with resulting predicates