the mechanics of probabilistic record matching

Upload: jtyzzer

Post on 11-Oct-2015

27 views

Category:

Documents


0 download

DESCRIPTION

A slide-based overview of the basics of classic probabilistic record matching

TRANSCRIPT

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    1/25

    THE MECHANICS OF

    PROBABILISTIC RECORDMATCHING

    Jeffrey Tyzzer

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    2/25

    Why Does this Deck Exist?

    !

    I struggled while studying probabilistic matching--

    reading, e.g., the works of Fellegi and Sunter,

    Newcombe, Schumacher, and Herzog, et al.--and

    wanted to summarize my findings as much to helpothers understand it as to check my own

    understanding. To that end, please direct any errors

    and constructive feedback to me at

    [email protected]

    2

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    3/25

    Agenda

    !

    Recall that Master Data Management (MDM)

    enables the consolidation and syndication of

    trusted, authoritative, data

    !

    In this presentation, we focus on the consolidation--

    or unification--of master data, which is the heart of

    all MDM systems

    3

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    4/25

    Matches

    !

    In a data set, constructs (i.e. records) are proxies for

    real-world objects

    ! Matches are entity instances (records) that have thesame values for those properties (attributes) that

    serve to identify them

    ! One of the goals of Master Data Management is to

    ensure that there is a 1:1 correspondence betweenthe real and proxy objects

    4

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    5/25

    Ways of Matching

    ! There are two principal ways to match: deterministically andprobabilistically! Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND

    R1a2 = R2a2 THEN Link ELSE NonLink

    ! Deterministic matching is binary--all or nothing! Probabilistic matching is likelihood-based

    ! Probabilistic matching is analog--its based on a range ofagreement

    ! The pioneers of probabilistic matching were Newcombe, et

    al., Tepping, and Fellegi & Sunter.! Probabilistic matching is particularly useful in the absence of

    unique identifiers, when only so-called quasi-identifiersareavailable, such as names and dates birth

    5

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    6/25

    Consider

    !

    R1 Name: Jeff Tyzzer Address: 848 Swanston Dr.

    Phone: (916) 555-1212

    ! R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr.Phone: 555-1212

    ! Would you consider these two records to be

    matches? Why? Would they be deterministic or

    probabilistic matches?

    6

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    7/25

    Hypothesis Testing

    !

    In classic probabilistic matching, we take our cue

    from inferential statistics when comparing two

    records probabilistically:

    ! H0- The null hypothesis: The records do not represent

    the same real-world object, i.e. they are not matches

    ! HA- The alternate hypothesis: The records represent thesame real-world object, i.e. they are matches

    ! Typically, H0is rejected if our test statistic is less than .

    05 (the so-called p-value)

    7

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    8/25

    Hypothesis Testing, contd

    !

    A Type I error, designated with the Greek letter

    (alpha), occurs when we incorrectly reject H0

    ! A Type II error, designated with the Greek letter (beta), occurs when we incorrectly fail to reject H0

    8

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    9/25

    Record Linkage and Type I & II Errors

    !

    Since weve decided that H0indicates that the

    records are different, if we commit a Type I error

    (incorrectly rejecting H0) were (wrongly) asserting

    that the records match. This is a false positive

    ! Since weve decided that HAindicates that the

    records are the same (matches), if we commit a

    Type II error (incorrectly failing to reject H0) were

    (wrongly) asserting that the records do not match.This is a false negative

    9

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    10/25

    Agreement Probabilities

    ! We must first decide on our match attributes, a domain-specific decision. For this presentation, we will use FirstName, Last Name, and DoB

    ! For our purposes, when comparing these attributesbetween records there are two possible outcomes: theywill agree or they wont

    ! We calculate the probabilities of these attributesagreeing under each of the preceding hypotheses.There are several methods for computing these; amongthem are sampling, prior studies, and MaximumLikelihood Estimation (MLE) using ExpectationMaximization (EM)

    10

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    11/25

    Example

    Attribute Non-match (H0) Match (HA)

    Last Name .05 .95

    First Name .15 .90

    DoB .25 .85

    ! Using one of the techniques mentioned in slide 10s

    last bullet point, say we find that, for our data, whenthe two records do in fact represent the same entity

    the last names match 95% of the time, the first names90%, and the DoBs 85%. When the two records are

    known to represent different entities, the match rates

    are much lower--5%, 15%, and 25%, respectively

    11

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    12/25

    Match Attribute Possibilities

    ! Since for simplicitys sake were saying that theattributes must simply either match or not--designating1 for a match and 0 for a non-match--then for our threeattributes we have the following 23agreement

    possibilities:

    LN FN DoB

    0 0 0

    1 0 0

    0 1 00 0 1

    1 1 0

    1 0 1

    0 1 1

    1 1 1

    12

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    13/25

    Match Attribute Probabilities

    !

    The space of all possibleagreement patterns is referred by theGreek letter (gamma)

    !

    Given the agreement probabilities listed on slide 11, we nextcompute two probabilities for each of the eight agreement patterns(slide 11) in (in the same attribute order): the m (match)probability and the u(non-match) probability

    ! Example - the mprobability for the (0,0,0) pattern (i.e. none match):

    (1 - .95) * (1 - .90) * (1 - .85) = 0.00075

    ! Example - the uprobability for the (1,0,1) pattern (match on LN andDoB):

    (.05) * (1-.15) * (.25) = 0.01063! The agreement pattern is viewed as a discrete random variable

    representing the set of all possible comparison outcomes

    13

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    14/25

    Match Attribute Probabilities, contd

    !

    The completed table looks like this:

    Agreement Pattern m u

    0,0,0 .00075 .605631,0,0 .01425 .03188

    0,1,0 .00675 .10688

    0,0,1 .00425 .20188

    1,1,0 .12825 .00563

    1,0,1 .08075 .01063

    0,1,1 .03825 .03563

    1,1,1 .72675 .00188

    14

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    15/25

    Observations

    ! Given the agreement probabilities on slide 11, only72.675% of the records would have matcheddeterministicallyand only 60.563% of those records thatdont match would have disagreed on all three attributes

    ! Both columns (must) sum to 1

    !

    Probabilistic matching gives us maybe in addition to yes andnoas a possible outcome--it lets us deal with those situationswhere not all attributes match, but some do (recall your

    answers to the questions on slide 6)!

    This technique assumes conditional independence among thematch attributes, which may not always be the case(consider the correlation between name and gender)

    15

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    16/25

    Almost There

    !

    The next two steps are:

    ! Calculate the log-likelihood ratio test statistic T, the

    base-2 logarithm of the ratio of mand u

    e.g., T = log2(0.03825/0.03563) = 0.10237

    and order the results ascending by T

    ! Sum the cumulative probabilities (mtop down, ubottomup)

    16

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    17/25

    The Test Statistic & Cumulative Probs

    Agreement

    Patternm u T m () u ()

    0,0,0 0.00075 0.60563 -9.65733 0.00075 1.00000

    0,0,1 0.00425 0.20188 -5.56989 0.00500 0.39441

    0,1,0 0.00675 0.10688 -3.98496 0.01175 0.19253

    1,0,0 0.01425 0.03188 -1.16169 0.02600 0.08565

    0,1,1 0.03825 0.03563 0.10237 0.06425 0.05377

    1,0,1 0.08075 0.01063 2.92532 0.14500 0.01814

    1,1,0 0.12825 0.00563 4.50968 0.27325 0.00751

    1,1,1 0.72675 0.00188 8.59458 1.00000 0.00188

    17

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    18/25

    Deciding on the Thresholds

    ! We have three choices when confronted with a pair of records: definitelylink them, definitely do not link them, and maybe link them. How do wedecide? By establishing thresholds for each of the three possibilities,resulting in three discrete (and disjoint) T regions (slide 17)

    ! If, as we said on slide 7, we reject H0when the test statistic is less than .05,

    then weve decided that were willing to accept an alpha of .05, meaningthat were OK with a Type I error (a false positive, given our definitions ofH0and HA) 5% of the time. In other words, were willing to accept that upto 5% of our linked records could be linked erroneously

    ! Assume that beta, our tolerance for a Type II error (a false negative, givenour definitions of H0and HA) is also .05. (Note that the false positive andnegative thresholds are domain-specific--whats the possible harm of a

    false positive in a hospital setting versus one for, say, a direct marketercompiling a household address list?)

    18

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    19/25

    Deciding on the Thresholds, contd

    !

    The sum of the m probabilities represents our falsepositive rate and the sum of the u probabilities is ourfalse negative rate. The last two columns in the table on

    slide 17, respectively, show these!

    Our settings of alpha and beta dictate that any pair ofrecords with a T of -1.16169() or less is a definitenon-link and that any pair of records with a T of2.92532 () or greater is a definite link. Thus,those with an agreement pattern of (0,1,1) areour maybes. This is known as the clerical reviewregion

    19

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    20/25

    A Graphical Representation

    0.00000

    0.20000

    0.40000

    0.60000

    0.80000

    1.00000

    1.20000

    -9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458

    20

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    21/25

    Interpretation

    ! Record pairs to the left of the red line (lambda) are acertain no and those to the right of the green line (mu)

    are a certain yes. In-between the two lines is the

    maybe region, whose record pairs require humanreview

    ! Fellegi & Sunters technique assures us that the mayberegion is as small as possible given our settings for

    alpha and beta (ref. the NeymanPearson lemma)!

    The width of the clerical region is a function of the

    values of and (slide 8)

    21

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    22/25

    Example I

    Record LN FN DoB

    1 Tyzzer John 5/26/19xx

    2 Tyzzer Jeff 5/26/19xx

    ! The agreement pattern is (1,0,1). Given its

    corresponding T value, these records would be

    classified as a match

    22

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    23/25

    Example II

    Record LN FN DoB

    1 Smith Jeff 5/26/19xx

    2 Tyzzer Jeff 5/26/19xx

    ! The agreement pattern is (0,1,1). Given its

    corresponding T value, these records would be

    classified as a maybe and queued for clericalreview

    23

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    24/25

    Some Final Thoughts

    ! To compute the agreement probabilities (slide 11), the expectationmaximization (EM) technique is usually employed. These probabilities driveall subsequent results

    ! The demonstrated scenario and examples are deliberately trivial

    ! A more realistic situation would likely include more match columns andseveral more possible configurations of them instead of simple agreementor disagreement

    ! A more realistic situation would also have accommodated fuzzy matchesand incorporated value-specific frequencies into the probabilitycalculations. For last name, say, the agreement pattern would then beinterpreted as the LN agrees and is , e.g. Smith

    !

    To reduce the number of record-to-record comparisons from n(n-1)/2(intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zipcode or the phonetic encoding of the surname) is typically used

    24

  • 5/21/2018 The Mechanics of Probabilistic Record Matching

    25/25

    References

    !

    B Do Chuong, and Serafim Batzoglou. What is the ExpectationMaximization Algorithm? Nature Biotechnology26.8 (2008): 897-9.

    !

    Fellegi, Ivan, and Alan B. Sunter. A Theory for Record Linkage.Journal of the American Statistical Association64.328 (1969):1183-1210.

    ! Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. DataQuality and Record Linkage Techniques. New York: Springer Science+ Business Media, 2007.

    ! ---. Record Linkage. WIREs Computational Statistics2.5 (2010):535-543.

    ! Kirkendall, Nancy. Weights in Computer Matching: Applications andan Information Theoretic Point of View. Record Linkage

    Techniques--1985. Internal Revenue Service.

    25