the mechanics of probabilistic record matching

5/21/2018 The Mechanics of Probabilistic Record Matching

1/25

THE MECHANICS OF

PROBABILISTIC RECORDMATCHING

Jeffrey Tyzzer


2/25

Why Does this Deck Exist?

!

I struggled while studying probabilistic matching--

reading, e.g., the works of Fellegi and Sunter,

Newcombe, Schumacher, and Herzog, et al.--and

wanted to summarize my findings as much to helpothers understand it as to check my own

understanding. To that end, please direct any errors

and constructive feedback to me at

[email protected]

2


3/25

Agenda

!

Recall that Master Data Management (MDM)

enables the consolidation and syndication of

trusted, authoritative, data

!

In this presentation, we focus on the consolidation--

or unification--of master data, which is the heart of

all MDM systems

3


4/25

Matches

!

In a data set, constructs (i.e. records) are proxies for

real-world objects

! Matches are entity instances (records) that have thesame values for those properties (attributes) that

serve to identify them

! One of the goals of Master Data Management is to

ensure that there is a 1:1 correspondence betweenthe real and proxy objects

4


5/25

Ways of Matching

! There are two principal ways to match: deterministically andprobabilistically! Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND

R1a2 = R2a2 THEN Link ELSE NonLink

! Deterministic matching is binary--all or nothing! Probabilistic matching is likelihood-based

! Probabilistic matching is analog--its based on a range ofagreement

! The pioneers of probabilistic matching were Newcombe, et

al., Tepping, and Fellegi & Sunter.! Probabilistic matching is particularly useful in the absence of

unique identifiers, when only so-called quasi-identifiersareavailable, such as names and dates birth

5


6/25

Consider

!

R1 Name: Jeff Tyzzer Address: 848 Swanston Dr.

Phone: (916) 555-1212

! R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr.Phone: 555-1212

! Would you consider these two records to be

matches? Why? Would they be deterministic or

probabilistic matches?

6


7/25

Hypothesis Testing

!

In classic probabilistic matching, we take our cue

from inferential statistics when comparing two

records probabilistically:

! H0- The null hypothesis: The records do not represent

the same real-world object, i.e. they are not matches

! HA- The alternate hypothesis: The records represent thesame real-world object, i.e. they are matches

! Typically, H0is rejected if our test statistic is less than .

05 (the so-called p-value)

7


8/25

Hypothesis Testing, contd

!

A Type I error, designated with the Greek letter

(alpha), occurs when we incorrectly reject H0

! A Type II error, designated with the Greek letter (beta), occurs when we incorrectly fail to reject H0

8


9/25

Record Linkage and Type I & II Errors

!

Since weve decided that H0indicates that the

records are different, if we commit a Type I error

(incorrectly rejecting H0) were (wrongly) asserting

that the records match. This is a false positive

! Since weve decided that HAindicates that the

records are the same (matches), if we commit a

Type II error (incorrectly failing to reject H0) were

(wrongly) asserting that the records do not match.This is a false negative

9


10/25

Agreement Probabilities

! We must first decide on our match attributes, a domain-specific decision. For this presentation, we will use FirstName, Last Name, and DoB

! For our purposes, when comparing these attributesbetween records there are two possible outcomes: theywill agree or they wont

! We calculate the probabilities of these attributesagreeing under each of the preceding hypotheses.There are several methods for computing these; amongthem are sampling, prior studies, and MaximumLikelihood Estimation (MLE) using ExpectationMaximization (EM)

10


11/25

Example

Attribute Non-match (H0) Match (HA)

Last Name .05 .95

First Name .15 .90

DoB .25 .85

! Using one of the techniques mentioned in slide 10s

last bullet point, say we find that, for our data, whenthe two records do in fact represent the same entity

the last names match 95% of the time, the first names90%, and the DoBs 85%. When the two records are

known to represent different entities, the match rates

are much lower--5%, 15%, and 25%, respectively

11


12/25

Match Attribute Possibilities

! Since for simplicitys sake were saying that theattributes must simply either match or not--designating1 for a match and 0 for a non-match--then for our threeattributes we have the following 23agreement

possibilities:

LN FN DoB

0 0 0

1 0 0

0 1 00 0 1

1 1 0

1 0 1

0 1 1

1 1 1

12


13/25

Match Attribute Probabilities

!

The space of all possibleagreement patterns is referred by theGreek letter (gamma)

!

Given the agreement probabilities listed on slide 11, we nextcompute two probabilities for each of the eight agreement patterns(slide 11) in (in the same attribute order): the m (match)probability and the u(non-match) probability

! Example - the mprobability for the (0,0,0) pattern (i.e. none match):

(1 - .95) * (1 - .90) * (1 - .85) = 0.00075

! Example - the uprobability for the (1,0,1) pattern (match on LN andDoB):

(.05) * (1-.15) * (.25) = 0.01063! The agreement pattern is viewed as a discrete random variable

representing the set of all possible comparison outcomes

13


14/25

Match Attribute Probabilities, contd

!

The completed table looks like this:

Agreement Pattern m u

0,0,0 .00075 .605631,0,0 .01425 .03188

0,1,0 .00675 .10688

0,0,1 .00425 .20188

1,1,0 .12825 .00563

1,0,1 .08075 .01063

0,1,1 .03825 .03563

1,1,1 .72675 .00188

14


15/25

Observations

! Given the agreement probabilities on slide 11, only72.675% of the records would have matcheddeterministicallyand only 60.563% of those records thatdont match would have disagreed on all three attributes

! Both columns (must) sum to 1

!

Probabilistic matching gives us maybe in addition to yes andnoas a possible outcome--it lets us deal with those situationswhere not all attributes match, but some do (recall your

answers to the questions on slide 6)!

This technique assumes conditional independence among thematch attributes, which may not always be the case(consider the correlation between name and gender)

15


16/25

Almost There

!

The next two steps are:

! Calculate the log-likelihood ratio test statistic T, the

base-2 logarithm of the ratio of mand u

e.g., T = log2(0.03825/0.03563) = 0.10237

and order the results ascending by T

! Sum the cumulative probabilities (mtop down, ubottomup)

16


17/25

The Test Statistic & Cumulative Probs

Agreement

Patternm u T m () u ()

0,0,0 0.00075 0.60563 -9.65733 0.00075 1.00000

0,0,1 0.00425 0.20188 -5.56989 0.00500 0.39441

0,1,0 0.00675 0.10688 -3.98496 0.01175 0.19253

1,0,0 0.01425 0.03188 -1.16169 0.02600 0.08565

0,1,1 0.03825 0.03563 0.10237 0.06425 0.05377

1,0,1 0.08075 0.01063 2.92532 0.14500 0.01814

1,1,0 0.12825 0.00563 4.50968 0.27325 0.00751

1,1,1 0.72675 0.00188 8.59458 1.00000 0.00188

17


18/25

Deciding on the Thresholds

! We have three choices when confronted with a pair of records: definitelylink them, definitely do not link them, and maybe link them. How do wedecide? By establishing thresholds for each of the three possibilities,resulting in three discrete (and disjoint) T regions (slide 17)

! If, as we said on slide 7, we reject H0when the test statistic is less than .05,

then weve decided that were willing to accept an alpha of .05, meaningthat were OK with a Type I error (a false positive, given our definitions ofH0and HA) 5% of the time. In other words, were willing to accept that upto 5% of our linked records could be linked erroneously

! Assume that beta, our tolerance for a Type II error (a false negative, givenour definitions of H0and HA) is also .05. (Note that the false positive andnegative thresholds are domain-specific--whats the possible harm of a

false positive in a hospital setting versus one for, say, a direct marketercompiling a household address list?)

18


19/25

Deciding on the Thresholds, contd

!

The sum of the m probabilities represents our falsepositive rate and the sum of the u probabilities is ourfalse negative rate. The last two columns in the table on

slide 17, respectively, show these!

Our settings of alpha and beta dictate that any pair ofrecords with a T of -1.16169() or less is a definitenon-link and that any pair of records with a T of2.92532 () or greater is a definite link. Thus,those with an agreement pattern of (0,1,1) areour maybes. This is known as the clerical reviewregion

19


20/25

A Graphical Representation

0.00000

0.20000

0.40000

0.60000

0.80000

1.00000

1.20000

-9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458

20


21/25

Interpretation

! Record pairs to the left of the red line (lambda) are acertain no and those to the right of the green line (mu)

are a certain yes. In-between the two lines is the

maybe region, whose record pairs require humanreview

! Fellegi & Sunters technique assures us that the mayberegion is as small as possible given our settings for

alpha and beta (ref. the NeymanPearson lemma)!

The width of the clerical region is a function of the

values of and (slide 8)

21


22/25

Example I

Record LN FN DoB

1 Tyzzer John 5/26/19xx

2 Tyzzer Jeff 5/26/19xx

! The agreement pattern is (1,0,1). Given its

corresponding T value, these records would be

classified as a match

22


23/25

Example II

Record LN FN DoB

1 Smith Jeff 5/26/19xx

2 Tyzzer Jeff 5/26/19xx

! The agreement pattern is (0,1,1). Given its

corresponding T value, these records would be

classified as a maybe and queued for clericalreview

23


24/25

Some Final Thoughts

! To compute the agreement probabilities (slide 11), the expectationmaximization (EM) technique is usually employed. These probabilities driveall subsequent results

! The demonstrated scenario and examples are deliberately trivial

! A more realistic situation would likely include more match columns andseveral more possible configurations of them instead of simple agreementor disagreement

! A more realistic situation would also have accommodated fuzzy matchesand incorporated value-specific frequencies into the probabilitycalculations. For last name, say, the agreement pattern would then beinterpreted as the LN agrees and is , e.g. Smith

!

To reduce the number of record-to-record comparisons from n(n-1)/2(intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zipcode or the phonetic encoding of the surname) is typically used

24


25/25

References

!

B Do Chuong, and Serafim Batzoglou. What is the ExpectationMaximization Algorithm? Nature Biotechnology26.8 (2008): 897-9.

!

Fellegi, Ivan, and Alan B. Sunter. A Theory for Record Linkage.Journal of the American Statistical Association64.328 (1969):1183-1210.

! Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. DataQuality and Record Linkage Techniques. New York: Springer Science+ Business Media, 2007.

! ---. Record Linkage. WIREs Computational Statistics2.5 (2010):535-543.

! Kirkendall, Nancy. Weights in Computer Matching: Applications andan Information Theoretic Point of View. Record Linkage

Techniques--1985. Internal Revenue Service.

25

the mechanics of probabilistic record matching

Documents

classic probabilistic

deterministic matching

ways of matching

record linkage

hypothesis testing

data set

alternate hypothesis

null hypothesis