mortality data in healthcare analytics€¦ · file accounted for a mere 17% of the cdc-reported...

Mortality Data in Healthcare AnalyticsSourcing Robust Data In a HIPAA-Compliant Manner

Executive Summary

The incorporation of mortality data into healthcare data sets allows fraud prevention, accurate billing and benefits distribution, and true outcome analysis (especially in fatal disease areas like oncology, where survival is a key endpoint). The net effect of adding mortality data is strengthening identity protection, reducing healthcare costs, and improving health treatments and care delivery. However, most healthcare data streams do not inherently capture mortality, and changes to the Social Security Administration’s (SSA) Death Master File (DMF), a leading public data source for mortality records, have dramatically reduced this file’s coverage from 2.5 million lives in 2010 to 460,000 lives in 2016. This study discusses options for filling the gaps in DMF data coverage, and presents a solution for joining this mortality data to healthcare data sets in a Health Insurance Portability and Accountability Act of 1996 (HIPAA)-compliant manner.

Key findings include:• Combining obituary data with the DMF data file is a robust method to increase U.S. mortality data

coverage; in 2016, obituary data adds 1.7 million unique records not included in the DMF data file.• Adding mortality data directly into a healthcare data environment could be considered a HIPAA

violation, as mortality data contains numerous fields that are personal health information (PHI) and personally identifiable information (PII), including names, dates of birth, and even social security numbers (from the DMF).

• Mortality data should be added to healthcare data environments only after performing HIPAA-compliant de-identification, and joining of mortality data to healthcare data should only be performed using encrypted patient tokens to match records, not by using the PHI or PII present in mortality data.

The Value of Mortality Data in Healthcare Analyses

The value of mortality data in healthcare analysis has long been recognized; over three decades ago, a United Nations report noted “the sector in which mortality information is most directly valuable is that of health.”1 Mortality data can be used to identify high-risk patient groups to support the effective reallocation of resources to improve health and health outcomes. Validating patient mortality helps protect patient identity, limits fraud, and ensures accurate billing processes. And in clinical development, measuring mortality provides a vital endpoint for new therapy approval, and the objective data for accurate outcome analysis. The net result for physicians, insurers, drug developers, and health policy makers is improved care, reduced costs and ultimately, more lives saved.

© Datavant www.datavant.com

Despite the value mortality data can bring to healthcare analyses, healthcare data sets often lack mortality data. Healthcare data are gathered as a result of an interaction with a site of care, and deceased patients no longer interact with points of care. For example, healthcare data may be collected in the form of insurance claims following a provider visit or laboratory test, as provider notes entered into an Electronic Medical Record (EMR), or as a prescription filled at a pharmacy. Unless a patient dies during a care interaction (e.g. in the hospital), a death will not be recorded in the healthcare dataset.

Mortality Sources: The Social Security Death Master File

Historically, healthcare users have filled the gaps in mortality data by incorporating death records from the SSA DMF. The DMF is an index of over 85 million records based on SSA payment records and death reports from family members, funeral homes, financial institutions, postal authorities, states, and other federal agencies, from 1936 to the present.2,3,4 Each DMF record typically includes the social security number, full name, date of birth, and date of death. The file is updated weekly, making it a highly valuable and timely record of U.S. deaths.

While the SSA does not guarantee data completeness or accuracy, the DMF has become an important data source for death verification and fraud protection. Medical researchers rely on the information to track study patients and verify death, while members of financial institutions, insurance companies, and governments rely on the information to verify identity and prevent fraud. In addition, the DMF is a more affordable, easier to use, and timelier source than the National Death Index (NDI) maintained by the Centers for Disease Control (CDC), despite the latter being a more complete record of U.S. deaths.

The DMF is considered a public file under the Freedom of Information Act (FOIA) and has been made available since 1980; however, access to the dataset is restricted, and is contingent upon both an application and an annual certification.4 The SSA grants access of the full dataset to approved state and federal agencies, whereas the Department of Commerce’s National Technical Information Service (NTIS) sells access to the public file to approved and certified private and public organizations.

The DMF Public File: A Diminishing Resource for Mortality Data

In 2011, the usefulness of the DMF public file as a comprehensive record of U.S. deaths changed dramatically, when the SSA ceased releasing state-level records as part of the file. For a decade, the DMF included state-reported deaths, but amid rising concerns that the file provided identity thieves easy access to personally identifiable information (PII), the SSA, citing the Social Security Act, determined it had been erroneously disclosing state records. Thus, the SSA removed 4.2 million historical death records, and ceased releasing state-reported death records in subsequent updates.5 In 2010, the SSA DMF file (with state records) included 2.5 million records6; by 2016, the file included only 460,000 records.

© Datavant | www.datavant.com

The impact of the SSA’s decision on the completeness of the DMF data file is clearly illustrated in Figure 1. In 2005, the DMF data file accounted for 90% of the deaths reported by the CDC, whereas in 2016, the DMF data file accounted for a mere 17% of the CDC-reported deaths. The net effect for healthcare data analysts is that the lack of reliable, timely mortality data can lead to billing and benefits errors, slow the collection of data for clinical trials or long-term epidemiological studies, and hamper outcomes analyses.

Figure 1. Annual U.S. Deaths (2005 – 2016): Reported Volume in DMF Data File vs. CDC NDI

Note: *CDC NDI values for 2015 and 2016 are estimates.

Leveraging U.S. Obituary Data to Increase Mortality Coverage

Given the continued decline in DMF coverage, we wanted to understand the extent to which obituary data might increase the mortality coverage beyond that of the DMF data file alone. To do so, we joined obituary data gathered since 2010 to the DMF file, removed any duplicate individuals shared between the two files, and counted the total number of unique records per year. By removing the duplicate records shared between the two files, we were able to assess the true additive effect of the obituary data set. We used the CDC’s NDI data as a benchmark for total U.S. deaths in a given year.

As shown in Figure 2, adding obituary data to the DMF data file results in a substantial increase in mortality data coverage. In 2016, the inclusion of obituary data added 1.7 million unique mortality records to the 460,000 included in the DMF public file. The combined data, shown in Figure 2, contain total mortality records of nearly 2.2 million individuals for 2016, or 82% of the total estimated lives in the CDC NDI. This study demonstrates that the addition of obituary data to the DMF data file is an effective and robust method of capturing the majority of U.S. mortality records, and is far superior to using the DMF data file alone.


90%82%

72%62% 59% 55% 51% 50% 45%

41%33%

17%

Figure 2. Filling the DMF Data Gap with Obituary Data

Note: *Percentage shown is the percentage of the total estimated CDC NDI death records in 2016 represented by the combined mortality data set. **Obituary data (De-duplicated): Mortality records present in both the DMF data file and in the Obituary records were removed from the Obituary data.

Incorporating Mortality Data in Healthcare Environments: Compliance Considerations

Healthcare organizations must abide by the security and privacy regulations set forth in the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and in the subsequent Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009. To comply with government regulations, all PHI data elements must be removed or the records de-identified, i.e. “anonymized”, before being incorporated into any other healthcare data set that is intended for use other than direct clinical support.

Incorporating mortality data into existing healthcare data sets presents notable security and compliance challenges. Public and privately-available mortality data include a wealth of PHI and PII (e.g. names, addresses, dates of birth and death, social security numbers, etc.); therefore, adding mortality data directly into a healthcare data set could be a HIPAA violation.

To be HIPAA-compliant, mortality data should be de-identified before being brought into a healthcare environment. However, straight de-identification presents an obvious challenge: once PHI/PII data elements are removed, users have no way to understand which deceased individuals in the mortality data set match the de-identified individuals in the destination healthcare data set.


0

5,00,000

10,00,000

15,00,000

20,00,000

25,00,000

30,00,000

2010 2011 2012 2013 2014 2015 2016

Num

ber

of U

.S. D

eath

Rec

ords

DMF Data File Obituary Data (De-duplicated)**

82%


For a detailed description of our de-identification/linking process, please see our whitepaper:

John Smith Sally Jones Sumita Raja

etc

AAA0001BBB0002CCC0505

etc

ZZZ4040GHGH111AAA0001

THHT2200YYY4848VVV2222

MMM7777etc

“Raw” Mortality DataTokenized

Mortality DataTokenized

Healthcare Data

De-identification and Tokenization

Matching patient records

without PHI

Adding mortality data in HIPAA-

compliant manner

Deceased on June 2, 2015

1 2 3

Overview of Datavant's De-Identification and Linking

Technology for Structured Data

HIPAA-Compliant De-Identification and Linking of Mortality Data Using Patient Tokens

To make our combined mortality data set HIPAA-compliant, we de-identified the records using Datavant technology. This technology processes the raw data from the de-duplicated DMF and obituary data file, removing names, addresses, and social security numbers. It further modifies the date of birth field to contain only the year of birth. Through these changes, the resulting mortality data set is statistically de-identified as dictated by the expert determination method allowed by HIPAA.

To make the de-identified mortality still l inkable to o ther data sets even with PHI / P II removed, we took advantage of the ability of Datavant's de-identification engine to add a unique, encrypted patient token to each record as it performs the de-identification. A token is an irreversible 44-character long string of characters built from a patient’s PHI that serves as unique ID for the person, and is reliably reproduced for the same patient in any data set where the software is run.

The resulting de-identified, tokenized mortality data set can now be joined to any similarly de-identified and tokenized healthcare data set in a HIPAA-compliant manner, because using this method, no PHI / PII is brought into the healthcare environment. The healthcare analyst simply has to match the tokens in the mortality data set to the tokens in their healthcare data to determine if any of their patients should be marked as deceased (Figure 3).

Figure 3. HIPAA-Compliant Joining of Mortality and Healthcare Data Sets: Patient De-identification, Token Creation, and Linking of Data Sets

Patient records in the healthcare data set(s) can now be flagged/updated with death information, while ensuring that users are never exposed to PHI (nor HIPAA violations).

Characterizing Patient Tokens to Enhance Data Certainty

The incorporation of mortality data into healthcare data sets comes with two certainty expectations: that the de-identified deceased patients are accurately matched across the data sets, and that patients in the mortality data set are actually deceased. To address these expectations, Datavant developed a method of characterizing the tokenized records in the mortality data. For each mortality record, Datavant includes both a uniqueness score, which indicates the likelihood that a unique encrypted token for a given mortality record designates a unique person (i.e. cannot be confused with another individual), and a death validity score, which indicates a confidence value that an individual associated with a given mortality data set is actually deceased.

The Uniqueness Score

The Uniqueness Score characterizes the encrypted token to help users understand the likelihood that the token designates a truly unique individual. For example, a token scheme may use PHI attributes of name, gender, and birthday to create unique encrypted tokens. Ideally, a single token would correspond to a single individual every time, but a separate study we have performed shows that 1% of tokens are not unique (because they are generated from non-unique PHI, e.g. multiple men named John Smith and born on the same date.) These non-unique tokens can generate multiple matches (i.e. false positives) and could result in inaccurate linking when the mortality data is merged with other datasets. (For the detailed analysis, please see, “Matching Accuracy of Patient Tokens in De-Identified Health Data Sets: A False Positive Analysis” at www.datavant.com).

To understand the potential false positive rate of each token within the mortality data set, Datavant took the token set for each record and determined the number of times the same tokens were present in other records. 99% of records in the mortality data set had a unique token, while the remaining 1% of records had tokens that were non-unique. By counting the number of records sharing a token, we ascribed a Uniqueness Score to the token in each record. This score represents the probability that a match with that token is accurate. For example, if a token is unique, then the Uniqueness Score we ascribe is 100%, meaning that any match with this token is likely to be accurate. Alternatively, a token set that is shared by four different individuals in the mortality data set is given a Uniqueness Score of 25%, meaning that any match with this token to the same token in a healthcare data set only has a 1 in 4 chance of being an accurate match of the correct individual.

The Death Validity Score

To address the expectation that mortality records accurately reflect deceased patients, Datavant characterizes the certainty that the individual is deceased by adding a Death Validity Score to the mortality data set. While this score can be made into a quantitative value, we did not feel that the depth of the data we had access to warranted that type of false precision. Therefore,


we opted to characterize the mortality records with a qualitative field.

The Death Validity characterization we created is based on a set of logical rules. The first rule we apply is to determine if an individual in the mortality data set is flagged in the original DMF data with a “Proof” flag (meaning the SSA has received a death certificate) or a “Verified” flag (meaning a family member has verified the death). The records are flagged in the mortality data as having a higher Death Validity Score. The second rule looks at the remaining un-flagged records to determine which individuals are flagged as deceased in both the original DMF data file and the obituary data file. Any records present in both of the underlying mortality data sets are also flagged as having a higher Death Validity Score. In this way, users of the mortality data set can differentiate between individuals who are more likely to be deceased from those who are not.

Study Conclusions

Based on this study, we conclude that mortality data can be joined with healthcare data in a HIPAA-compliant manner. We were successful in combining obituary data with SSA DMF data to create a much more robust and representative mortality data set than using either source alone. By de-identifying and tokenizing this data set, we can merge it with healthcare data without bringing PHI into the data environment, thus avoiding a HIPAA violation. To join de-identified mortality records to de-identified healthcare records, we show that unique encrypted patient tokens can be used to generate accurate matches in 99% of cases. We demonstrate that we can identify the 1% of records for whom matching should be viewed with some trepidation through the addition of a Uniqueness Score. We likewise illustrate that we can characterize the strength of the mortality data using a qualitative scoring system, which we call a Death Validity Score.

Having proven this method for adding mortality data to healthcare data in a de-identified and HIPAA-compliant manner, Datavant is making this mortality data set available to external parties. The Datavant mortality data module is updated weekly as new data is added from SSA DMF updates and new obituary data records, and is sent to Datavant customers with records tokenized in their site-specific key. In this way, users of Datavant's de-identification and linking engines can easily and continuously add mortality data to their data sets for fraud prevention, outcomes research, and other use cases.

Sources

1. www.ncbi.nlm.nih.gov/nlmcatalog/1010214392. www.ssdmf.com3. https://www.ssa.gov/dataexchange/request_dmf.html4. https://classic.ntis.gov/products/ssa-dmf/#5. http://oig.ssa.gov/newsroom/congressional-testimony/hearing-social-securitys-death-records6. http://www.nytimes.com/2012/10/09/us/social-security-death-record-limits-hinder-researchers.html?_r=1&



For more information:

Contact Jason LaBonte, Ph.D. for questions or comments about this analysis: [email protected]

Contact Lauren Stahl for more information about the Datavant modules that were used in this study: [email protected]

Visit the Datavant website to read our other whitepapers and materials: www.datavant.com

Organizing the World's Health Data

Datavant helps organizations safely share and link healthcare data.

We believe in connecting healthcare data to eliminate the silos of healthcare information that hold back innovative medical research and improved patient care. We help data owners manage the privacy, security, compliance, and trust required to enable safe data sharing.

Datavant's vision is backed by Roivant Sciences, Softbank, and Founders Fund, and combines technical leadership and healthcare expertise. Datavant is located in the heart of San Francisco's Financial District.

Glossary of Terms:

Covered EntityA covered entity (CE) under HIPAA is a health care provider (e.g. doctors, dentists, pharmacies, etc), a health plan (e.g. private insurance, government programs like Medicare, etc), or a health care clearinghouse (i.e. entities that process and transmit healthcare information).

De-identified health dataDe-identified health data is data that has had PII removed. Per the HIPAA Privacy Rule, healthcare data not in use for clinical support must have all information that can identify a patient removed before use. This rule offers two paths to compliantly remove this information: the Safe Harbor method and the Statistical method. When these identifying elements have been removed, the resulting de-identified health data set can be used without restriction or disclosure.

Deterministic matchingDeterministic matching is when fields in two data sets are matched using a unique value. In practice, this value can be a social security number, Medicare Beneficiary ID, or any other value that is known to only correspond to a single entity. Deterministic matching has higher accuracy rates than probabilistic matching, but is not perfect due to data entry errors (mis-typing a social security number such that matching on that field actually matches two different individuals).

Encrypted patient tokenEncrypted patient tokens are non-reversible 44 character strings created from a patient’s PHI, allowing a patient’s records to be matched across different de-identified health data sets without exposure of the original PHI.

False positiveA false positive is a result that incorrectly states that a test condition is positive. In the case of matching patient records between data sets, a false positive is the condition where a “match” of two records does not actually represent records for the same patient. False positives are more common in probabilistic matching than in deterministic matching.

Fuzzy matchingFuzzy matching is the process of finding values that match approximately rather than exactly. In the case of matching PHI, fuzzy matching can include matching on different variants of a name (Jamie, Jim, and Jimmy all being allowed as a match for “James”). To facilitate fuzzy matching, algorithms like SOUNDEX can allow for differently spelled character strings to generate the same output value.

Health Information Technology for Economic and Clinical Health (HITECH) Act The HITECH Act was passed as part of the as part of the American Recovery and Reinvestment Act of 2009 (ARRA) economic stimulus bill. HITECH was designed to accelerate the adoption of electronic medical records (EMR) through the use of financial incentives for “meaningful use” of EMRs until 2015,


and financial penalties for failure to do so thereafter. HITECH added important security regulations and data breach liability rules that built on the rules laid out in HIPAA.

Health Insurance Portability and Accountability Act of 1996 (HIPAA) HIPAA is a U.S. law requiring the U.S. Department of Health and Human Services (HHS) to develop security and privacy regulations for protected health information. Prior to HIPAA, no such standards existed in the industry. HHS created the HIPAA Privacy Rule and HIPAA Security Rule to fulfill their obligation, and the Office for C ivil R ights (OCR) within HHS has the responsibility of enforcing these rules.

Personally-identifiable information (PII)Personally-identifiable information (PII) is a general term in information and security laws describing any information that allows an individual to be identified either directly or indirectly. PII is a U.S.-centric abbreviation, but is generally equivalent to “personal information” and similar terms outside the United States. PII can consist as informational elements like name, address, social security number or other identifying number or code, telephone number, email address, etc., but can include non-specific data elements such as gender, race, birth date, geographic indicator, etc. that together can still allow indirect identification of an individual.

Probabilistic matchingProbabilistic matching is when fields in two data sets are matched using values that are known not to be unique, but the combination of values gives a high probability that the correct entity is matched. In practice, names, birth dates, and other identifying but non-unique values can be used (often in combination) to facilitate probabilistic matching.

Protected health information (PHI)Protected health information (PHI) refers to information that includes health status, health care (physician visits, prescriptions, procedures, etc.), or payment for that care and can be linked to an individual. Under U.S. law, PHI is information that is specifically created or collected by a covered entity.

Safe Harbor de-identificationHIPAA guidelines requiring the removal of identifying information offered covered entities a simple, compliant path to satisfying the HIPAA Privacy Rule through the Safe Harbor method. The Safe Harbor de-identification method is to remove any data element that falls within 18 different categories of information, including:

1. Names2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP

code, and their equivalent geocodes. However, you do not have to remove the first three digits of the ZIP code if there are more than 20,000 people living in that ZIP code.

3. The day and month of dates that are directly related to an individual, including birth date, date of admission and discharge, and date of death. If the patient is over age 89, you must also remove his age and the year of his birth date.


4. Telephone number5. Fax number6. Email addresses7. Social Security number8. Medical record number9. Health plan beneficiary number10. Account number11. Certificate or license number12. Vehicle identifiers and serial numbers, including license plate numbers13. Device identifiers and serial numbers14. Web addresses (URLs)15. Internet Protocol (IP) addresses16. Biometric identifiers, such as fingerprints17. Full-face photographs or comparable images18. Any other unique identifying number, such as a clinical trial number

Social Security Death Master FileThe U.S. Social Security Administration maintains a file of over 86 million records of deaths collected from social security payments, but it is not a complete compilation of deaths in the United States. In recent years, multiple states have opted out of contributing their information to the Death Master File and its level of completeness has declined substantially. This Death Master File has limited access, and users must be certified to receive it. This file contains PHI elements like social security numbers, names, and dates of birth – therefore, bringing the raw data into a healthcare data environment could risk a HIPAA violation.

SoundexSoundex is a phonetic algorithm that codes similarly sounding names (in English) as a consistent value. Soundex is commonly used when matching surnames across data sets as variations in spelling are common in data entry. Each soundex code generated from an input text string has 4 characters – the first letter of the name, and then 3 digits generated from the remaining characters, with similar-sounding phonetic elements coded the same (e.g. D and T are both coded as a 3, M and N are both coded as a 5).

Statistical de-identification (also known as Expert Determination)Because the HIPAA Safe Harbor de-identification method removes all identifying elements, the resulting de-identified health data set is often stripped of substantial analytical value. Therefore, statistical de-identification is used instead (HIPAA calls this pathway to compliance “Expert Determination”). In this method, a statistician or HIPAA certification professional certifies that enough identifying data elements have been removed from the health data set that there is a “very small risk” that a recipient could identify an individual. Statistical de-identification often allows dates of service to remain in de-identified data sets, which are critical for the analysis of a patient’s journey, for determining an episode of care, and other common healthcare investigations.


mortality data in healthcare analytics€¦ · file accounted for a mere 17% of the cdc-reported...

Documents