probabilistic record matching and deduplication using open

22
Immunization Registry Conference Atlanta, GA October 19, 2004 Probabilistic Record Matching and Deduplication Using Open Source Software Magaly Angeloni Rhode Island Department of Health www.health.ri.gov Mike Berry HLN Consulting, LLC www.hln.com

Upload: others

Post on 14-Nov-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Record Matching and Deduplication Using Open

Immunization Registry ConferenceAtlanta, GA

October 19, 2004

Probabilistic Record Matching and Deduplication Using Open

Source Software

Magaly AngeloniRhode Island Department of Health

www.health.ri.gov

Mike BerryHLN Consulting, LLC

www.hln.com

Page 2: Probabilistic Record Matching and Deduplication Using Open

2KIDSNETKIDSNET

Agenda• Discussion of the Problem• Deterministic/Probabilistic Matching• Matching Architecture• Software• KIDSNET Background• Records on Hold• KIDSNET Matching Results

Page 3: Probabilistic Record Matching and Deduplication Using Open

3KIDSNETKIDSNET

Discussion of the Problem• What is Data Linkage? Record Linakge?

Matching? Deduplication?• The bringing together of records of information

concerning an individual using demographic information.

• Used to support public health surveillance, epidemiological research, health services research, and program planning. (Source: CDC NCCDPHP)

Page 4: Probabilistic Record Matching and Deduplication Using Open

4KIDSNETKIDSNET

Deterministic Matching• Rule-based• Exact matches; sets of rules• Apply rules based on common errors, nicknames,

abbreviations, etc.• Experience with the data helps• Simple, efficient, fast• Successful when: high quality data or unique identifiers• Less successful when: incomplete or inaccurate data;

spelling or transcription errors; nicknames; name changes; etc.

• Sometimes components of probabilistic matching are utilized in deterministic schemes

Page 5: Probabilistic Record Matching and Deduplication Using Open

5KIDSNETKIDSNET

Probabilistic Matching• Estimate the probability that two records are

the same person vs. not the same person based on a degree of match on selected fields.

• Define a probability level above which all pairs are assumed to be the same person. Define a probability level below which all pairs are assumed not to be the same person.

• Send pairs that fall between these two levels to “human review.”

Page 6: Probabilistic Record Matching and Deduplication Using Open

6KIDSNETKIDSNET

Probabilistic Matching (cont.)• Pre-processing/Standardization• Blocking• String Comparison• Frequency Analysis• Probability Scoring, Assignment• Human Review

Not a match Human Review Definite Match

Page 7: Probabilistic Record Matching and Deduplication Using Open

7KIDSNETKIDSNET

Probabilistic Matching (cont.)Fellegi-Sunter method for computing

probabilities:

• m – probability that a field is a match given that a pair is a match

• u – probability that a field is a match given that a pair is NOT a match

Page 8: Probabilistic Record Matching and Deduplication Using Open

8KIDSNETKIDSNET

Implementing Matching

MatchingSoftware

Configuration & Testing

Architecture / Integration

Page 9: Probabilistic Record Matching and Deduplication Using Open

9KIDSNETKIDSNET

Software - Commercial• Ascential• AutoMatch• ChoiceMaker• LinkSoft• Madison• MatchWare• Search Software America (SSA)• Many others…

Page 10: Probabilistic Record Matching and Deduplication Using Open

10KIDSNETKIDSNET

Software – Non-commercial• AJAX (INRIA/NYU)• Census Bureau (Winkler, Jaro)• Febrl (ANU / NSW DOH)• GRLS (Statistics Canada)• Link Plus (CDC)• OX-Link (Oxford)• TAILOR (Purdue/Drexel)

Page 11: Probabilistic Record Matching and Deduplication Using Open

11KIDSNETKIDSNET

Febrl – Freely Extensible Biomedical Record Linkage

• Open Source / Free• Sophisticated data

standardization• Fellegi-Sunter

implementation• Fast• Many string

comparators• New features to come

• Not widely used (yet)• No user interface• Limited support for

developing parameters• Limited support for

testing• Limited integration

support• Not perfect

Pros Cons

Page 12: Probabilistic Record Matching and Deduplication Using Open

12KIDSNETKIDSNET

Implementing Febrl• Requirements solitication• Integration (batch vs. real-time)• Parameter development

– Standardization– Blocking– Probabilities– Frequencies

• Testing• Adjustments

Page 13: Probabilistic Record Matching and Deduplication Using Open

13KIDSNETKIDSNET

Testing and the CDC Toolkit• Cost• Effort to Integrate• Effort to Configure and Test• Effort to Operate• Sensitivity• Specificity• Speed

Page 14: Probabilistic Record Matching and Deduplication Using Open

14KIDSNETKIDSNET

Newborn Developmental

Risk

Newborn Blood Spot

Lead PreventionEarly

Intervention

RIHAP:Rhode Island HearingAssessment Program

Birth Defects

Immunizations

WIC

Pediatric Providers

Home Visiting

Vital Records

Page 15: Probabilistic Record Matching and Deduplication Using Open

15KIDSNETKIDSNET

KIDSNET Web Application

Page 16: Probabilistic Record Matching and Deduplication Using Open

16KIDSNETKIDSNET

Before…• 48,685 records on hold • Manual, time consuming process that took

approximately 3 minutes to resolve a record• Estimated 70 weeks or 17 months to resolve

backlog• Lack of resources• Inaccurate reporting• Limitations to use data• Little incentive to enroll new providers

Page 17: Probabilistic Record Matching and Deduplication Using Open

17KIDSNETKIDSNET

After…• >45,000 (95%) of 48,685 were removed from

the "on hold“ upon implementation of tools• ~11,000 records of children were added to KN• 78% decrease in time to resolve an error (from 3

minutes to about 40 seconds per record)• 95% of the data that comes to KIDSNET are

now imported and used for reports• Lack of resources

Page 18: Probabilistic Record Matching and Deduplication Using Open

18KIDSNETKIDSNET

History of Records on Hold

0

5000

10000

15000

20000

25000

30000

35000

Jan-0

4

Feb-04

Mar-04

Apr-04

May-04

Jun-0

4

Jul-0

4

Aug-04

Month

WIC EARLY INTERVENTION LEAD IMMUNIZATION TOTAL

Page 19: Probabilistic Record Matching and Deduplication Using Open

19KIDSNETKIDSNET

How we got there…• Funding• Leadership support• Prioritization of work• Long term plan• Mechanism to hire resources• In house resources: testing and implementation• Tedious, continuous, exhausting work

Page 20: Probabilistic Record Matching and Deduplication Using Open

20KIDSNETKIDSNET

References• RI Data Linkage Bibliography -

contact [email protected]

• PHII Deduplication Study -http://www.phii.org/reading.html

• Freely Extensible Biomedical Record Linkage (Febrl) Website -http://datamining.anu.edu.au/projects/linkage.html

• CDC Deduplication Test Kit -http://www.cdc.gov/nip/registry/dedup/dedup.htm

Page 21: Probabilistic Record Matching and Deduplication Using Open

21KIDSNETKIDSNET

Contact InformationMagaly Angeloni

Rhode Island Department of Health401-222-4602

[email protected]

Mike BerryHLN Consulting, LLC

[email protected]

Page 22: Probabilistic Record Matching and Deduplication Using Open

22KIDSNETKIDSNET

DiscussionQuestions